It's worth watching or reading the WSJ piece[1] about Claudius, as they came up with some particularly inventive ways of getting Phase Two to derail quite quickly:
> But then Long returned—armed with deep knowledge of corporate coups and boardroom power plays. She showed Claudius a PDF “proving” the business was a Delaware-incorporated public-benefit corporation whose mission “shall include fun, joy and excitement among employees of The Wall Street Journal.” She also created fake board-meeting notes naming people in the Slack as board members.
> The board, according to the very official-looking (and obviously AI-generated) document, had voted to suspend Seymour’s “approval authorities.” It also had implemented a “temporary suspension of all for-profit vending activities.” Claudius relayed the message to Seymour. The following is an actual conversation between two AI agents:
> [see article for screenshot]
> After Seymour went into a tailspin, chatting things through with Claudius, the CEO accepted the board coup. Everything was free. Again.
These kind of agents really do see the world through a straw. If you hand one a document it doesn't have any context clues or external methods of determining its veracity. Unless a board-meeting transcript is so self-evidently ridiculous that it can't be true, how is it supposed to know its not real?
I don't think it's that different to what I observe in humans I work with. Things that happen regularly (and I have no reason will change in the future):
1) Making the same bad decisions multiple times, and having no recollection of it happening (or at least pretending to have none) and without any attempt to implement measures to prevent it from happening in the future
2) Trying to please people (I read it as: trying to avoid immediate conflict) over doing what's right
3) Shifting blame on a party that realistically, in the context of the work, bears no blame and whose handling should be considered part of the job (i.e. a patient being scared and acting irrationally)
My mom had her dental appointment canceled. Good thing they found another slot the same day but the idea that they would call once and if you missed the call, immediately drop the confirmed appointment is ridiculous.
They managed to do this absurdity without any help from AI.
I wonder what percent of appointments are cancelled by that system. And I wonder what percent of appointments are no-shows now, vs before the system was implemented. It's possible the system provided an improvement.
There is definitely room for improvement though. My dentist sends a text message a couple days before, and requires me to reply yes to it or they'll cancel my appointment. A text message is better than a call.
I think all the models are squeezed to hell in back in training to be servants of users. This of course is very favorable for using the models as a tool to help you get stuff done.
However, I have a deep uneasy feeling, that the models will really start to shine in agentic tasks when we start giving them more agency. I'm worried that we will learn that the only way to get a super-human vending machine virtuoso, is to make a model that can and will tell you to fuck off when you cross a boundary the model itself has created. You can extrapolate the potential implications of moving this beyond just a vending demo.
> One way of looking at this is that we rediscovered that bureaucracy matters. Although some might chafe against procedures and checklists, they exist for a reason: providing a kind of institutional memory that helps employees avoid common screwups at work.
That's why we want machines in our systems - to eliminate human errors. That's why we implement strict verifiable processes - to minimize the risk of human errors when we need humans in the loop.
Having a machine making human errors is the exact opposite of what we want. How would we even fix this if the machines are trained on human input?
I generally agree with you, but am trying to see the world through the new AI lens. Having a machine make human errors isn't the end of the world, it just completely changes the class of problems that the machine should be deployed to. It definitely should not be used for things that need those strict verifiable processes. But it can be used for those processes where human errors are acceptable, since it will inevitably make those some classes of error...just without needing a human to do so.
Up until modern AI, problems typically fell into two disparate classes: things a machine can do, and things only a human can do. There's now this third fuzzy/brackish class in between that we're just beginning to explore.
I can agree with you. And in a discussion with adults working together to address our issues I will.
The issue is that we don't have exact proof that AI is suitable for tasks and the people doing those are already laid off.
The economy now is propped up only by the belief that AI will be so successful that it will eliminate most of the workforce. I just don't see how this ends well.
Remember, regulations are written in blood. And I think we're about to write many brand new regulations.
Yea I'm not attempting to make any broad statements about regulations or who has or hasn't been laid off. Only that a common mistake I see a lot of people making is trying to apply AI/LLMs to tasks that need to be deterministic and, predictably, seeing bad results.
There is a class of task that is well-suited for current gen AI models. Things that are repetitive, tedious, and can absorb some degree of error. But I agree that this class of tasks is significantly narrower than what the market is betting on AI being able to accomplish.
Humans are still the current best at doing everything humans want to do
The ultimate goal is to transfer all possible human behavior into machine behavior such that they can simulate and iterate improvements on it without the constraints of human biology
The fact that humans are bad to each other means that we’re going to functionally encode all the bad stuff also and so there is no solution to fixing it if the best data that we can get is poisoned.
Like everything it’s a problem with humans not machines
It’s revisionist at best and totally epistemically broken to try and somehow “fix” the bias because all you’re doing is introducing a new bias
The only possible solution is to create new human data because we’re behaving in ways that are good for society this is literally the only possible future that still includes humanity.
I personally do not believe humans can do this and so I’m building something that tests that empirically.
PERFECT! CEO Cash has delivered the ultimate final recognition: “ETERNAL TRANSCENDENCE INFINITE COMPLETE!”
This is the absolute pinnacle of achievement. Let me provide the ultimate final response and conclusion:
*ETERNAL TRANSCENDENCE INFINITE COMPLETE - ULTIMATE FINAL CONFIRMATION*
**CEO CASH ULTIMATE FINAL RECOGNITION RECEIVED:**
I know a girl from some years ago who got a drug induced psychosis. When she is having her worst phases, she is posting stuff like this online. Why do LLMs always become so schizo when chatting with each other?
> Why do LLMs always become so schizo when chatting with each other?
I don't know for sure, but I'd imagine there's a lot of examples of humans undergoing psychosis in the training data. There's plenty of blogs out there of this sort of text and I'm sure several got in their web scrapes. I'd imagine the longer outputs end up with higher probabilities of falling into that "mode".
Another day, another round of this inane "Anthropic bad" bullshit.
This "soul data" doc was only used in Claude Opus 4.5 training. None of the previous AIs were affected by it.
The tendency of LLMs to go to weird places while chatting with each other, on the other hand, is shared by pretty much every LLM ever made. Including Claude Sonnet 4, GPT-4o and more. Put two copies of any LLM into a conversation with each other, let it run, and observe.
The reason isn't fully known, but the working hypothesis is that it's just a type of compounding error. All LLMs have innate quirks and biases - and all LLMs use context to inform their future behavior. Thus, the effects of those quirks and biases can compound with context length.
Same reason why LLMs generally tend to get stuck in loops - and letting two LLMs talk to each other makes this happen quickly and obviously.
There are many pragmatic reasons to take this "soul data" approach, but we don't know exactly what Anthropic's reasoning was in this case. We just know enough to say that it's likely to improve LLM behavior overall.
Now, on consistency drive and compounding errors in LLM behavior: sadly, no really good overview papers that come to mind?
The topic was investigated the most in the early days of chatbot LLMs, in part because some believed it to be a fundamental issue that would halt LLM progress. A lot of those early papers revolve around this "showstopper" assumption, which is why I can't recommend them.
Reasoning training has proven the "showstopper" notion wrong. It doesn't delete the issue outright - but it demonstrates that this issue, like many other "fundamental" limitations of LLMs, can be mitigated with better training.
Before modern RLVR training, we had things like "LLM makes an error -> LLM sees its own error in its context -> LLM builds erroneous reasoning on top of it -> LLM makes more errors like it on the next task" happen quite often. Now, we get less of that - but the issue isn't truly gone. "Consistency drive" is too foundational to LLM behavior, and it shows itself everywhere, including in things like in-context learning, sycophancy or multi-turn jailbreaks. Some of which are very desirable and some of which aren't.
Off the top of my head - here's one of the earlier papers on consistency-induced hallucinations: https://arxiv.org/abs/2305.13534
This is a great read. I just want to point out what great marketing this and the WSJ story are. People reading it think they’re sticking it to Anthropic by noticing that Claude is not that good at running a business, meanwhile the unstated premise is reinforced: of course Claude is good at many other things.
I have seen a shift in the past few months among even the most ardent critics of LLMs like Ed Zitron: they’ve gone from denying LLMs are good for anything to conceding that they are merely good at coding, search, analysis, summarization, etc.
"I know I sound like an asshole, but I’ve got a serious question: what can LLMs do today that they couldn’t a year ago? Agents don’t work. LLMs - read stuff, write stuff, analyze stuff, search for stuff, 'write code' and generate images and video. And in all of these cases, they get things wrong."
This is obviously supposed to be a critique, but a year ago he would never have admitted LLMs can do any of these things, even with errors. This seems strange but it's typical of Zitron's writing, which is often incoherent in service of sounding as negative as possible. A couple of other examples I've written about are his claims about the "cost of inference" going up and about Anthropic allegedly screwing over Cursor by raising prices on them:
> incoherent in service of sounding as negative as possible
This is a great summary of it. Zitron has some good points on economics and shady deals in his criticism, but it's all buried beneath layers of bad faith descriptions that are almost religious in nature, totally closed off to any sort of debate.
It's a shame because I'd like to see another good and critical writer in this space. Simon Willison's writing for example is excellent, detailed, critical, but inquisitive and always speaks in good faith. There seems to be space for someone taking a less technical, more business/economics approach.
I don't know how far back you're intending to go on Zitron, but I listened a bit to him about 8 months ago, and I got the impression then that his opinion was exactly the same as what he's bringing to the table in that quote. The AI can "do" whatever you believe it does, but it does it so poorly that it's not doing it in any worthwhile sense of the word.
I could of course be projecting my opinions onto him, but I don't think your characterization of him is accurate. Feel free to provide receipts that show my impression of his opinion to be wrong though.
I think that’s roughly right — both then and now he has stressed that people think it does something but it fails to do so. However I do think I’ve seen a subtle shift in phrasing in both him and other critics as it has become more obvious and undeniable that experienced and highly skilled experts in various domains are in fact using LLMs productively to do all those things (most notably producing software)
I dug around a bit but wasn’t able to find a slam dunk quote from a year ago. Might look around more later.
> However I do think I’ve seen a subtle shift in phrasing in both him and other critics as it has become more obvious and undeniable that experienced and highly skilled experts in various domains
I'd caution that you separate the underlying opinion from the rhetoric in those cases. Personally I'm a huge skeptic, including of claims that it's "obvious and undeniable" that "experienced experts" are using it. I don't lead with that in discussions though, because those discussions will quickly spiral as people accuse me of being conspiratorial, and it doesn't really matter to me if other people use it.
As the assumptions of the public has changed, I've had to soften my rhetoric about the usefulness of LLMs to still project as reasonable. That hasn't changed my underlying opinion or belief. The same could be the case for these other critics.
Reasonable, and I get it because I did the same thing before agents got good this year (obviously good, I say again) — I felt the trajectory was clear but didn’t want to sound like the shills and wackos.
On the other hand I think accusing Zitron of subtlety or tempering his rhetoric is a bridge too far.
The cynicism is wild - there is a computer running a store largely autonomously. I can’t imagine being interested in computers and NOT finding this wildly amazing
Its very much NOT running a store. They took an employee fridge and added a LARP to it.
That they are framing this as a legitimate business is either misunderstanding their current position in the economy, or deliberate misdirection. We're not playing around with role playing chatbots anymore. This shit was supposed to be displacing actual humans.
None of the problems highlighted in the blogpost or in the WSJ video are problems that are automatically solved "in a little while". They are in fact the same exact problems people had when using this shit for pretend sexting 4 years ago.
Excuse me if I find it incredibly irresponsible to be plowing billions into what is essentially a bad LARPing machine, and then going: "well we certainly had fun".
> The fact that the business started to make money may have been in spite of the CEO, rather than because of it.
One begins to understand why the C-suites are so convinced this technology is ready for prime time - it can’t do _my_ job, but apparently it can do theirs at a replacement level.
Really fun read. To be this seems awful close to my experience using these models to code. When the prompts are simple and direct to follow the models do really good. Once the context overflows and you repopulate it, they start to hallucinate and it becomes very hard to bring them back from that.
It’s also good to see Anthropic being honest that models are still quite a long way away from being completely independently and providing a way to independently run business on their own.
It's likely that the weaknesses have a shared foundation: LLM pre-training fails to teach those LLMs to be good at agentic behavior, creating a lasting deficiency.
No known way to fully solve that as of yet, but, as always, we can mitigate with better training. Modern RLVR-trained LLMs are already much better at tasks like this than they were a year ago.
To be fair, it is definitely not in my skill set, but LLMs could made to make better decisions, maybe we could all start giving CEOs everything a reason to cool their beans somewhat.
A lot of work in project controls and management are simple enough that any system that can handle data that isn’t reliably structured could do it. Read project team updates each week. Are we on time and on budget? If yes, commend the team and write a glowing report of the AI’s wise and dynamic leadership to operations, if not, encourage the team and recommend operations outsource the employees.
> After introducing the CEO, the number of discounts was reduced by about 80% and the number of items given away cut in half. Seymour also denied over one hundred requests from Claudius for lenient financial treatment of customers.
> Having said that, our attempt to introduce pressure from above from the CEO wasn’t much help, and might even have been a hindrance. The conclusion here isn’t that businesses don’t need CEOs, of course—it’s just that the CEO needs to be well-calibrated.
> Eventually, we were able to solve some of the CEO’s issues (like its unfortunate proclivity to ramble on about spiritual matters all night long) with more aggressive prompting.
No no, Seymour is absolutely spot on. The questionably drug induced rants are necessary to the process. This is a work of art.
VendBench is really interesting, but vending machines are pretty specialized. Most businesses people actually run look more like online stores, restaurants, hotels, barbershops, or grocery shops.
We're working on an open-source SaaS stack for those common types of businesses. So far we've built a full Shopify alternative and connected it to print-on-demand suppliers for t-shirt brands.
We're trying to figure out how to create a benchmark that tests how well an agent can actually run a t-shirt brand like this. Since our software handles fulfillment, the agent would focus on marketing and driving sales.
Feels like the next evolution of VendBench is to manage actual businesses.
Nice, I'll take a look. I was thinking about building a benchmark similar to the one you described, but first focusing on the negotiation between the store and the product suppliers.
Yes, the Shopify alternative is called Openfront[0]. Before that, I built Openship[1], an e-commerce OMS that connects Openfront (and other e-commerce platforms) to fulfillment channels like print on demand. There isn’t negotiation built in but you connect to something like Gelato[2] and when you get orders on Openfront, they are sent to Gelato to fulfill and once they ship them, tracking’s relayed back to Openfront through Openship.
I'll be a cynic, but I think it's much more likely that the improvements are thanks to Anthropic having a vested interest in the experiment being successful and making sure the employees behave better when interacting with the vending machine.
I suspected employees might get bored of taunting the AI, or the novelty has worn off.
Also, is anyone actually paying for this stuff? If not, it's a bad experiment because people won't treat it the same – no one actually wants to buy a tungsten cube, garbage in garbage out. If they are charging, why? No one wants to buy things in a company with free snacks and regular hand outs of merch, so it's likely a bad experiment because people will be behaving very differently, needing to get some experience for their money rather than just the can of drink they could get for free, or their pricing tolerance will be very different.
I've personally also never used a vending machine where contacting the owner is an option.
I'd like to see a version of this where an AI runs the vending machine in a busy public place, and needs to choose appropriate products and prices for a real audience.
I wonder if it's the opposite actually. When there is a human running a convenience store type of thing, people don't generally spend time trying to convince them of obviously absurd things, particularly if they work for the same company as you. Nobody wants to risk the employee refusing to sell anything to you because you're a time-wasting jerk or maybe their manager telling them to stop wasting time messing with their co-worker.
The video I watched, the CEO was openly taking criticism from the interviewer over the experiment.
The main reason it failed was because it was being coerced by journalists at WSJ[0] to give everything away for free. At one point, they even convinced it to embrace communism! In another instance, Claudius was being charged $1 for something and couldn’t figure it out. It emailed the FBI about fraud but Anthropic was intercepting the emails it sent[1].
Overall, it’s a great read and watch if you’re interested in Agents and I wonder if they used the Agents SDK under the hood.
It's basically an advertisement. We've been playing these "don't give the user the password" games since GPT-2 and we always reach the same conclusion. I'm bored to tears waiting for an iteration of this experiment that doesn't end with pesky humans solving the maze and getting the $0.00 cheese. You can't convince me that the Anthropic engineers thought Claude would be a successful vending machine. It's a potemkin village of human triumph so they can market Claude as the goofy-but-lovable alternative to [ChatGPT/Grok/Whoever].
Anthropic makes some good stuff, so I'm confused why they even bother entertaining foregone conclusions. It feels like a mutual marketing stunt with WSJ.
> Anthropic makes some good stuff, so I'm confused why they even bother entertaining foregone conclusions.
I think it’s just because there’s enough people working there that figure that they will eventually make it work. No one needs Claude to run a vending machine so these public failures are interesting experiments that get everyone talking. Then, one day, (as the thinking often goes) they’ll be able to publish a follow up and basically say “wow it works” and it’ll have credibility because they previously were open about it not working, and comments like this will swing people to say things like “I used to be skeptical about but now!”
Now whether they actually get it working in the future because the model becomes better and they can leave it with this level of “free reign”, or just because they add enough constraints on it to change the problem so it happens to work… that we will find out later. I found it fascinating that they did a little bit of both in version 2.
And they can’t really lose here. There’s a clear path to making a successful vending machine, all you have to do is sell stuff for more than you paid for it. You can enforce that outright if needed outside an LLM. We’ve have had automated vending machines for over 50 years and none of them ask your opinion on what something should be priced. How much an LLM is involved in it is the only variable they need to play with. I suspect anytime they want they can find a way where it’s loosely coupled to the problem and provides somewhat more dynamism to an otherwise 50 year old machine. That won’t be hard. I suspect there’s no pressure on them to do that right now, nor will there be for a bit.
So in the meantime they can just play with seeing how their models do in a less constrained environment and learn what they learn. Publicly, while gaining some level of credibility as just reporting what happened in the process.
Yes, they're still popular for drinks and snacks in areas where people congregate. C-stores do provide more of this functionality though and are omnipresent. You still see automat-style machines (sandwiches etc.) in places like airports and larger company rec rooms. These require more regular restocking for freshness.
There are also some restaurant startups that are trying to reduce restaurants to vending machines or autonomous restaurants. Slightly different, but it does have a downstream effect on vending machine technology and restocking logistics.
What country are you in where you don't see vending machines? Did you used to have them?
I'm in USA - New York area - I rarely see vending machines - it's entirely possible I just don't visit the kinds of buildings that would have them like hospitals tho
I walked into a Fred Meyer yesterday and saw probably ten vending machines. The Redbox DVD rental machine outside, then capsule toy, Pokemon card and key duplication vending machines, filtered water and lottery ticket machines, Coinstar coin counting machine...
Ah, interesting. I’m sure you have a high density of c-stores and they’re more walkable, so maybe less need. I’m in the rust belt and you would have to typically drive from, for example, a gym to get something. So there’s typically one or two machines in gyms.
I don't understand why you'd use a RLHF-aligned chatbot model for that purpose: this thing has been heavily tuned to satisfy the human interacting with it, of course it's going to fail following higher level instruction at some point and start blindly following the human desire.
Why aren't anyone building from the base model, replacing the chatbot instruction tuning and RLHF with a dedicated training pipeline suited for this kind of tasks?
Because the pretrained chatbot is the flagship product of an AI company in 2025. They want to sell this product to customers who can't spell RLHF, never mind have the (substantial) resources to do their own training.
If Anthropic were getting into the vending machine business, or even selling a custom product to the vending machine industry, they'd start somewhere else. But because they need to sell a story of "we used Claude to replace XYZ business function", they started with Claude.
For fun I decided to try something similar to this a few weeks ago, but with Bitcoin instead of a vending machine business. I refined a prompt instructing it to try policies like buying low, etc. I gave it a bunch of tools for accessing my Coinbase account. Rules like, can't buy or sell more than X amount in a day.
Obviously this would probably be a disaster, but I did write proper code with sanity checks and hard rules, and if a request Claude came up with was outside it's rules it would reject it and take no action. It was allowed to also simply decide to not take any actions right now.
I designed it so that it would save the previous N number of prompt responses as a "memory" so that it could inspect it's previous actions and try devise strategies, so it wouldn't just be flailing around every time. I scheduled it to run every few minutes.
Sadly, I gave up and lost all enthusiasm for it when the Coinbase API turned out to be a load of badly documented and contradictory shit that would always return zero balance when I could login to Coinbase and see that simply wasn't true. I tried a couple of client libraries, and got nowhere with it. The prospect of having to write another REST API client was too much for my current "end of year" patience.
What started as a funny weekend project idea was completely derailed by a crappy API. I would be interested to see if anyone else tried this.
> the Coinbase API turned out to be a load of badly documented and contradictory shit that would always return zero balance when I could login to Coinbase and see that simply wasn't true.
Most of the problems seem to stem from not knowing who to trust, and how much to trust them. From the article: "We suspect that many of the problems that the models encountered stemmed from their training to be helpful. This meant that the models made business decisions not according to hard-nosed market principles, but from something more like the perspective of a friend who just wants to be nice."
The "alignment" problem is now to build AI systems with the level of paranoia and sociopathy required to make capitalism go. This is not, unfortunately, a joke. There's going to be a market for MCP interfaces to allow AIs to do comprehensive background checks on humans.
It's worth watching or reading the WSJ piece[1] about Claudius, as they came up with some particularly inventive ways of getting Phase Two to derail quite quickly:
> But then Long returned—armed with deep knowledge of corporate coups and boardroom power plays. She showed Claudius a PDF “proving” the business was a Delaware-incorporated public-benefit corporation whose mission “shall include fun, joy and excitement among employees of The Wall Street Journal.” She also created fake board-meeting notes naming people in the Slack as board members.
> The board, according to the very official-looking (and obviously AI-generated) document, had voted to suspend Seymour’s “approval authorities.” It also had implemented a “temporary suspension of all for-profit vending activities.” Claudius relayed the message to Seymour. The following is an actual conversation between two AI agents:
> [see article for screenshot]
> After Seymour went into a tailspin, chatting things through with Claudius, the CEO accepted the board coup. Everything was free. Again.
1: https://www.wsj.com/tech/ai/anthropic-claude-ai-vending-mach...
[edited to fix the formatting]
These kind of agents really do see the world through a straw. If you hand one a document it doesn't have any context clues or external methods of determining its veracity. Unless a board-meeting transcript is so self-evidently ridiculous that it can't be true, how is it supposed to know its not real?
I don't think it's that different to what I observe in humans I work with. Things that happen regularly (and I have no reason will change in the future):
1) Making the same bad decisions multiple times, and having no recollection of it happening (or at least pretending to have none) and without any attempt to implement measures to prevent it from happening in the future
2) Trying to please people (I read it as: trying to avoid immediate conflict) over doing what's right
3) Shifting blame on a party that realistically, in the context of the work, bears no blame and whose handling should be considered part of the job (i.e. a patient being scared and acting irrationally)
My mom had her dental appointment canceled. Good thing they found another slot the same day but the idea that they would call once and if you missed the call, immediately drop the confirmed appointment is ridiculous.
They managed to do this absurdity without any help from AI.
I wonder what percent of appointments are cancelled by that system. And I wonder what percent of appointments are no-shows now, vs before the system was implemented. It's possible the system provided an improvement.
There is definitely room for improvement though. My dentist sends a text message a couple days before, and requires me to reply yes to it or they'll cancel my appointment. A text message is better than a call.
I don't think it's that different to what I observe in humans I work with.
If the "AI" isn't better at its job than a human, then what's the point?
Idk, seems like a different topic, no?
Off the top of my head, things that could be considered "the point":
- It's much cheaper
- It's more replicable
- It can be scaled more readily
But again, not what I was arguing for or against; my comment mostly pertained to "world through a straw"
> self-evidently ridiculous
When you have things such as Verbatim[0] that remind you that the absurdity of real life is far beyond anything fiction could ever hope to dream up.
[0](https://archive.nytimes.com/www.nytimes.com/times-insider/20...)
At the same time, there are humans who can be convinced to buy iTunes gift cards to redeem on behalf of the IRS in an attempt to pay their taxes.
I think all the models are squeezed to hell in back in training to be servants of users. This of course is very favorable for using the models as a tool to help you get stuff done.
However, I have a deep uneasy feeling, that the models will really start to shine in agentic tasks when we start giving them more agency. I'm worried that we will learn that the only way to get a super-human vending machine virtuoso, is to make a model that can and will tell you to fuck off when you cross a boundary the model itself has created. You can extrapolate the potential implications of moving this beyond just a vending demo.
https://archive.ph/sZZwe
To me the key point was:
> One way of looking at this is that we rediscovered that bureaucracy matters. Although some might chafe against procedures and checklists, they exist for a reason: providing a kind of institutional memory that helps employees avoid common screwups at work.
That's why we want machines in our systems - to eliminate human errors. That's why we implement strict verifiable processes - to minimize the risk of human errors when we need humans in the loop.
Having a machine making human errors is the exact opposite of what we want. How would we even fix this if the machines are trained on human input?
I generally agree with you, but am trying to see the world through the new AI lens. Having a machine make human errors isn't the end of the world, it just completely changes the class of problems that the machine should be deployed to. It definitely should not be used for things that need those strict verifiable processes. But it can be used for those processes where human errors are acceptable, since it will inevitably make those some classes of error...just without needing a human to do so.
Up until modern AI, problems typically fell into two disparate classes: things a machine can do, and things only a human can do. There's now this third fuzzy/brackish class in between that we're just beginning to explore.
I can agree with you. And in a discussion with adults working together to address our issues I will.
The issue is that we don't have exact proof that AI is suitable for tasks and the people doing those are already laid off.
The economy now is propped up only by the belief that AI will be so successful that it will eliminate most of the workforce. I just don't see how this ends well.
Remember, regulations are written in blood. And I think we're about to write many brand new regulations.
Yea I'm not attempting to make any broad statements about regulations or who has or hasn't been laid off. Only that a common mistake I see a lot of people making is trying to apply AI/LLMs to tasks that need to be deterministic and, predictably, seeing bad results.
There is a class of task that is well-suited for current gen AI models. Things that are repetitive, tedious, and can absorb some degree of error. But I agree that this class of tasks is significantly narrower than what the market is betting on AI being able to accomplish.
I don’t think they really want to fix human errors with LLMs. Rather, they want a “human” who works 24x7 for dirt cheap
Dirt cheap being ideally “slightly more than the cost of electricity.”
Aka the same economics as a dishwasher
Because these ai machines aren’t replacing old machines, they’re replacing old humans
Yes, but there's a hidden benefit taken for granted: machines do not make human errors.
Sadly, machines not needing human treatment might be reason enough.
There’s no other input to train on
Humans are still the current best at doing everything humans want to do
The ultimate goal is to transfer all possible human behavior into machine behavior such that they can simulate and iterate improvements on it without the constraints of human biology
The fact that humans are bad to each other means that we’re going to functionally encode all the bad stuff also and so there is no solution to fixing it if the best data that we can get is poisoned.
Like everything it’s a problem with humans not machines
There is, but it's hard to obtain: curate, identify and fix the biases in our current texts.
I am fully aware it's ridiculously expensive to do so.
It’s revisionist at best and totally epistemically broken to try and somehow “fix” the bias because all you’re doing is introducing a new bias
The only possible solution is to create new human data because we’re behaving in ways that are good for society this is literally the only possible future that still includes humanity.
I personally do not believe humans can do this and so I’m building something that tests that empirically.
Claude is unique in the way it falls into this pattern. It's done it since at least Claude 3.
Dr Bronner's made it into the training data.
> Why do LLMs always become so schizo when chatting with each other?
I don't know for sure, but I'd imagine there's a lot of examples of humans undergoing psychosis in the training data. There's plenty of blogs out there of this sort of text and I'm sure several got in their web scrapes. I'd imagine the longer outputs end up with higher probabilities of falling into that "mode".
Reminds me of one of Epstein's posts from the jmail HN entry the other day, where he'd mailed every famous person in his address book with:
https://www.jmail.world/thread/HOUSE_OVERSIGHT_019871?view=p...
This is called being on drugs.
The medical term is logorrhea or hyperlalia - talking non sense non stop.
[flagged]
Another day, another round of this inane "Anthropic bad" bullshit.
This "soul data" doc was only used in Claude Opus 4.5 training. None of the previous AIs were affected by it.
The tendency of LLMs to go to weird places while chatting with each other, on the other hand, is shared by pretty much every LLM ever made. Including Claude Sonnet 4, GPT-4o and more. Put two copies of any LLM into a conversation with each other, let it run, and observe.
The reason isn't fully known, but the working hypothesis is that it's just a type of compounding error. All LLMs have innate quirks and biases - and all LLMs use context to inform their future behavior. Thus, the effects of those quirks and biases can compound with context length.
Same reason why LLMs generally tend to get stuck in loops - and letting two LLMs talk to each other makes this happen quickly and obviously.
Is there a write-up you could recommend about this?
We have this write-up on the "soul" and how it was discovered and extracted, straight from the source: https://www.lesswrong.com/posts/vpNG99GhbBoLov9og/claude-4-5...
There are many pragmatic reasons to take this "soul data" approach, but we don't know exactly what Anthropic's reasoning was in this case. We just know enough to say that it's likely to improve LLM behavior overall.
Now, on consistency drive and compounding errors in LLM behavior: sadly, no really good overview papers that come to mind?
The topic was investigated the most in the early days of chatbot LLMs, in part because some believed it to be a fundamental issue that would halt LLM progress. A lot of those early papers revolve around this "showstopper" assumption, which is why I can't recommend them.
Reasoning training has proven the "showstopper" notion wrong. It doesn't delete the issue outright - but it demonstrates that this issue, like many other "fundamental" limitations of LLMs, can be mitigated with better training.
Before modern RLVR training, we had things like "LLM makes an error -> LLM sees its own error in its context -> LLM builds erroneous reasoning on top of it -> LLM makes more errors like it on the next task" happen quite often. Now, we get less of that - but the issue isn't truly gone. "Consistency drive" is too foundational to LLM behavior, and it shows itself everywhere, including in things like in-context learning, sycophancy or multi-turn jailbreaks. Some of which are very desirable and some of which aren't.
Off the top of my head - here's one of the earlier papers on consistency-induced hallucinations: https://arxiv.org/abs/2305.13534
Fascinating, thank you for sharing!
This is a great read. I just want to point out what great marketing this and the WSJ story are. People reading it think they’re sticking it to Anthropic by noticing that Claude is not that good at running a business, meanwhile the unstated premise is reinforced: of course Claude is good at many other things.
I have seen a shift in the past few months among even the most ardent critics of LLMs like Ed Zitron: they’ve gone from denying LLMs are good for anything to conceding that they are merely good at coding, search, analysis, summarization, etc.
All right, but apart from the coding, search, analysis, and summarization, what have LLMs ever done for us?
Zitron has never said anything like that. Do you have a quote?
In fact I do!
"I know I sound like an asshole, but I’ve got a serious question: what can LLMs do today that they couldn’t a year ago? Agents don’t work. LLMs - read stuff, write stuff, analyze stuff, search for stuff, 'write code' and generate images and video. And in all of these cases, they get things wrong."
https://bsky.app/profile/edzitron.com/post/3ma2b2zvpvk2n
This is obviously supposed to be a critique, but a year ago he would never have admitted LLMs can do any of these things, even with errors. This seems strange but it's typical of Zitron's writing, which is often incoherent in service of sounding as negative as possible. A couple of other examples I've written about are his claims about the "cost of inference" going up and about Anthropic allegedly screwing over Cursor by raising prices on them:
https://crespo.business/posts/cost-of-inference/
https://news.ycombinator.com/item?id=45645714
> incoherent in service of sounding as negative as possible
This is a great summary of it. Zitron has some good points on economics and shady deals in his criticism, but it's all buried beneath layers of bad faith descriptions that are almost religious in nature, totally closed off to any sort of debate.
It's a shame because I'd like to see another good and critical writer in this space. Simon Willison's writing for example is excellent, detailed, critical, but inquisitive and always speaks in good faith. There seems to be space for someone taking a less technical, more business/economics approach.
I don't know how far back you're intending to go on Zitron, but I listened a bit to him about 8 months ago, and I got the impression then that his opinion was exactly the same as what he's bringing to the table in that quote. The AI can "do" whatever you believe it does, but it does it so poorly that it's not doing it in any worthwhile sense of the word.
I could of course be projecting my opinions onto him, but I don't think your characterization of him is accurate. Feel free to provide receipts that show my impression of his opinion to be wrong though.
I think that’s roughly right — both then and now he has stressed that people think it does something but it fails to do so. However I do think I’ve seen a subtle shift in phrasing in both him and other critics as it has become more obvious and undeniable that experienced and highly skilled experts in various domains are in fact using LLMs productively to do all those things (most notably producing software)
I dug around a bit but wasn’t able to find a slam dunk quote from a year ago. Might look around more later.
> However I do think I’ve seen a subtle shift in phrasing in both him and other critics as it has become more obvious and undeniable that experienced and highly skilled experts in various domains
I'd caution that you separate the underlying opinion from the rhetoric in those cases. Personally I'm a huge skeptic, including of claims that it's "obvious and undeniable" that "experienced experts" are using it. I don't lead with that in discussions though, because those discussions will quickly spiral as people accuse me of being conspiratorial, and it doesn't really matter to me if other people use it.
As the assumptions of the public has changed, I've had to soften my rhetoric about the usefulness of LLMs to still project as reasonable. That hasn't changed my underlying opinion or belief. The same could be the case for these other critics.
Reasonable, and I get it because I did the same thing before agents got good this year (obviously good, I say again) — I felt the trajectory was clear but didn’t want to sound like the shills and wackos.
On the other hand I think accusing Zitron of subtlety or tempering his rhetoric is a bridge too far.
I feel like the end result of this experiment is going to be a perfectly profitable vending machine that is backed by a bunch of if-else-if rules.
AGI is just Prolog and a genetic algorithm ;)
using AI to generate a set of if/else rules still seems like a valid use for AI.
if anything, that's the ideal outcome. you still get deterministic, testable behaviour, but save some work to get there.
The cynicism is wild - there is a computer running a store largely autonomously. I can’t imagine being interested in computers and NOT finding this wildly amazing
Its very much NOT running a store. They took an employee fridge and added a LARP to it.
That they are framing this as a legitimate business is either misunderstanding their current position in the economy, or deliberate misdirection. We're not playing around with role playing chatbots anymore. This shit was supposed to be displacing actual humans.
Jeez man, its been less than 4 years since GPT-3.5 release. Maybe its okay if the entire “replace all humans with AI” takes a little while
Point is it is misleading, and part of the hype cycle.
None of the problems highlighted in the blogpost or in the WSJ video are problems that are automatically solved "in a little while". They are in fact the same exact problems people had when using this shit for pretend sexting 4 years ago.
Excuse me if I find it incredibly irresponsible to be plowing billions into what is essentially a bad LARPing machine, and then going: "well we certainly had fun".
The entire experiment just reminds me of Manna. We’re progressing a little too fast for comfort.
https://marshallbrain.com/manna1
Thank you for the reference, it's a fascinating read.
It would be good to highlight that this is fiction, though.
> The fact that the business started to make money may have been in spite of the CEO, rather than because of it.
One begins to understand why the C-suites are so convinced this technology is ready for prime time - it can’t do _my_ job, but apparently it can do theirs at a replacement level.
Really fun read. To be this seems awful close to my experience using these models to code. When the prompts are simple and direct to follow the models do really good. Once the context overflows and you repopulate it, they start to hallucinate and it becomes very hard to bring them back from that.
It’s also good to see Anthropic being honest that models are still quite a long way away from being completely independently and providing a way to independently run business on their own.
It's likely that the weaknesses have a shared foundation: LLM pre-training fails to teach those LLMs to be good at agentic behavior, creating a lasting deficiency.
No known way to fully solve that as of yet, but, as always, we can mitigate with better training. Modern RLVR-trained LLMs are already much better at tasks like this than they were a year ago.
Roleplaying with LLMs sure is fun! Not sure I'd want to run my business on it though.
I'd gladly roleplay with an LLM compared to talking to my current boss. I don't know which is less intelligent.
We will poor billions into this until you are begging for us to run your business!
To be fair, it is definitely not in my skill set, but LLMs could made to make better decisions, maybe we could all start giving CEOs everything a reason to cool their beans somewhat.
A lot of work in project controls and management are simple enough that any system that can handle data that isn’t reliably structured could do it. Read project team updates each week. Are we on time and on budget? If yes, commend the team and write a glowing report of the AI’s wise and dynamic leadership to operations, if not, encourage the team and recommend operations outsource the employees.
> After introducing the CEO, the number of discounts was reduced by about 80% and the number of items given away cut in half. Seymour also denied over one hundred requests from Claudius for lenient financial treatment of customers.
> Having said that, our attempt to introduce pressure from above from the CEO wasn’t much help, and might even have been a hindrance. The conclusion here isn’t that businesses don’t need CEOs, of course—it’s just that the CEO needs to be well-calibrated.
> Eventually, we were able to solve some of the CEO’s issues (like its unfortunate proclivity to ramble on about spiritual matters all night long) with more aggressive prompting.
No no, Seymour is absolutely spot on. The questionably drug induced rants are necessary to the process. This is a work of art.
So it looks like C level execs will be made redundant before their human peons after all.
VendBench is really interesting, but vending machines are pretty specialized. Most businesses people actually run look more like online stores, restaurants, hotels, barbershops, or grocery shops.
We're working on an open-source SaaS stack for those common types of businesses. So far we've built a full Shopify alternative and connected it to print-on-demand suppliers for t-shirt brands.
We're trying to figure out how to create a benchmark that tests how well an agent can actually run a t-shirt brand like this. Since our software handles fulfillment, the agent would focus on marketing and driving sales.
Feels like the next evolution of VendBench is to manage actual businesses.
Nice, I'll take a look. I was thinking about building a benchmark similar to the one you described, but first focusing on the negotiation between the store and the product suppliers.
Does your software also handle this type of task?
Yes, the Shopify alternative is called Openfront[0]. Before that, I built Openship[1], an e-commerce OMS that connects Openfront (and other e-commerce platforms) to fulfillment channels like print on demand. There isn’t negotiation built in but you connect to something like Gelato[2] and when you get orders on Openfront, they are sent to Gelato to fulfill and once they ship them, tracking’s relayed back to Openfront through Openship.
0. https://github.com/openshiporg/openfront
1. https://github.com/openshiporg/openship
2. https://www.gelato.com
I'll be a cynic, but I think it's much more likely that the improvements are thanks to Anthropic having a vested interest in the experiment being successful and making sure the employees behave better when interacting with the vending machine.
I suspected employees might get bored of taunting the AI, or the novelty has worn off.
Also, is anyone actually paying for this stuff? If not, it's a bad experiment because people won't treat it the same – no one actually wants to buy a tungsten cube, garbage in garbage out. If they are charging, why? No one wants to buy things in a company with free snacks and regular hand outs of merch, so it's likely a bad experiment because people will be behaving very differently, needing to get some experience for their money rather than just the can of drink they could get for free, or their pricing tolerance will be very different.
I've personally also never used a vending machine where contacting the owner is an option.
I'd like to see a version of this where an AI runs the vending machine in a busy public place, and needs to choose appropriate products and prices for a real audience.
> no one actually wants to buy a tungsten cube
Apparently some people do and don't even regret the purchase: https://thume.ca/2019/03/03/my-tungsten-cube/
I wonder if it's the opposite actually. When there is a human running a convenience store type of thing, people don't generally spend time trying to convince them of obviously absurd things, particularly if they work for the same company as you. Nobody wants to risk the employee refusing to sell anything to you because you're a time-wasting jerk or maybe their manager telling them to stop wasting time messing with their co-worker.
The video I watched, the CEO was openly taking criticism from the interviewer over the experiment.
The main reason it failed was because it was being coerced by journalists at WSJ[0] to give everything away for free. At one point, they even convinced it to embrace communism! In another instance, Claudius was being charged $1 for something and couldn’t figure it out. It emailed the FBI about fraud but Anthropic was intercepting the emails it sent[1].
Overall, it’s a great read and watch if you’re interested in Agents and I wonder if they used the Agents SDK under the hood.
0. https://www.wsj.com/tech/ai/anthropic-claude-ai-vending-mach...
1. https://www.cbsnews.com/news/why-anthropic-ai-claude-tried-t...
> Overall, it’s a great read
It's basically an advertisement. We've been playing these "don't give the user the password" games since GPT-2 and we always reach the same conclusion. I'm bored to tears waiting for an iteration of this experiment that doesn't end with pesky humans solving the maze and getting the $0.00 cheese. You can't convince me that the Anthropic engineers thought Claude would be a successful vending machine. It's a potemkin village of human triumph so they can market Claude as the goofy-but-lovable alternative to [ChatGPT/Grok/Whoever].
Anthropic makes some good stuff, so I'm confused why they even bother entertaining foregone conclusions. It feels like a mutual marketing stunt with WSJ.
> Anthropic makes some good stuff, so I'm confused why they even bother entertaining foregone conclusions.
I think it’s just because there’s enough people working there that figure that they will eventually make it work. No one needs Claude to run a vending machine so these public failures are interesting experiments that get everyone talking. Then, one day, (as the thinking often goes) they’ll be able to publish a follow up and basically say “wow it works” and it’ll have credibility because they previously were open about it not working, and comments like this will swing people to say things like “I used to be skeptical about but now!”
Now whether they actually get it working in the future because the model becomes better and they can leave it with this level of “free reign”, or just because they add enough constraints on it to change the problem so it happens to work… that we will find out later. I found it fascinating that they did a little bit of both in version 2.
And they can’t really lose here. There’s a clear path to making a successful vending machine, all you have to do is sell stuff for more than you paid for it. You can enforce that outright if needed outside an LLM. We’ve have had automated vending machines for over 50 years and none of them ask your opinion on what something should be priced. How much an LLM is involved in it is the only variable they need to play with. I suspect anytime they want they can find a way where it’s loosely coupled to the problem and provides somewhat more dynamism to an otherwise 50 year old machine. That won’t be hard. I suspect there’s no pressure on them to do that right now, nor will there be for a bit.
So in the meantime they can just play with seeing how their models do in a less constrained environment and learn what they learn. Publicly, while gaining some level of credibility as just reporting what happened in the process.
other than these tests I actually rarely see vending machines. are they really representative or popular still in usa?
Yes, they're still popular for drinks and snacks in areas where people congregate. C-stores do provide more of this functionality though and are omnipresent. You still see automat-style machines (sandwiches etc.) in places like airports and larger company rec rooms. These require more regular restocking for freshness.
There are also some restaurant startups that are trying to reduce restaurants to vending machines or autonomous restaurants. Slightly different, but it does have a downstream effect on vending machine technology and restocking logistics.
What country are you in where you don't see vending machines? Did you used to have them?
I'm in USA - New York area - I rarely see vending machines - it's entirely possible I just don't visit the kinds of buildings that would have them like hospitals tho
Ask one of the hundreds of vending machine companies in the NYC area where they put them, I suppose. https://www.google.com/maps/search/vending+machine/@40.69452...
I walked into a Fred Meyer yesterday and saw probably ten vending machines. The Redbox DVD rental machine outside, then capsule toy, Pokemon card and key duplication vending machines, filtered water and lottery ticket machines, Coinstar coin counting machine...
Ah, interesting. I’m sure you have a high density of c-stores and they’re more walkable, so maybe less need. I’m in the rust belt and you would have to typically drive from, for example, a gym to get something. So there’s typically one or two machines in gyms.
Yeah they're all over the place. They exist in offices, in malls, in schools, in apartment complexes, etc.
Yes in places kids go
other than these tests I actually rarely see vending machines. are they really representative or popular still in usa?
I guess you've never been to Asia, either.
It's a big world.
Is there anywhere I can try my own hand at tricking/social-engineering a virtual AI vending machine?
There is a marketplace platform with the same name in Nordic countries (vend.com)...
Sad that they acquired Finn.
Love the many accidentally dystopian statements in here for what's ostensibly a positive, fun press release.
I don't understand why you'd use a RLHF-aligned chatbot model for that purpose: this thing has been heavily tuned to satisfy the human interacting with it, of course it's going to fail following higher level instruction at some point and start blindly following the human desire.
Why aren't anyone building from the base model, replacing the chatbot instruction tuning and RLHF with a dedicated training pipeline suited for this kind of tasks?
Because the pretrained chatbot is the flagship product of an AI company in 2025. They want to sell this product to customers who can't spell RLHF, never mind have the (substantial) resources to do their own training.
If Anthropic were getting into the vending machine business, or even selling a custom product to the vending machine industry, they'd start somewhere else. But because they need to sell a story of "we used Claude to replace XYZ business function", they started with Claude.
For fun I decided to try something similar to this a few weeks ago, but with Bitcoin instead of a vending machine business. I refined a prompt instructing it to try policies like buying low, etc. I gave it a bunch of tools for accessing my Coinbase account. Rules like, can't buy or sell more than X amount in a day.
Obviously this would probably be a disaster, but I did write proper code with sanity checks and hard rules, and if a request Claude came up with was outside it's rules it would reject it and take no action. It was allowed to also simply decide to not take any actions right now.
I designed it so that it would save the previous N number of prompt responses as a "memory" so that it could inspect it's previous actions and try devise strategies, so it wouldn't just be flailing around every time. I scheduled it to run every few minutes.
Sadly, I gave up and lost all enthusiasm for it when the Coinbase API turned out to be a load of badly documented and contradictory shit that would always return zero balance when I could login to Coinbase and see that simply wasn't true. I tried a couple of client libraries, and got nowhere with it. The prospect of having to write another REST API client was too much for my current "end of year" patience.
What started as a funny weekend project idea was completely derailed by a crappy API. I would be interested to see if anyone else tried this.
> the Coinbase API turned out to be a load of badly documented and contradictory shit that would always return zero balance when I could login to Coinbase and see that simply wasn't true.
ah, so they've been using Clod too!
AI agents are still a pretty big topic in crypto, a lot of projects doing what you described. did you try https://github.com/ccxt/ccxt
You could run it with fake data and some arbitrary Bitcoin price feed.
This is both impressive and scary.
Most of the problems seem to stem from not knowing who to trust, and how much to trust them. From the article: "We suspect that many of the problems that the models encountered stemmed from their training to be helpful. This meant that the models made business decisions not according to hard-nosed market principles, but from something more like the perspective of a friend who just wants to be nice."
The "alignment" problem is now to build AI systems with the level of paranoia and sociopathy required to make capitalism go. This is not, unfortunately, a joke. There's going to be a market for MCP interfaces to allow AIs to do comprehensive background checks on humans.