33 Comments
User's avatar
Jack Shanahan's avatar

This is very helpful, thanks.

As someone who’s generally optimistic about the integration of AI for national security, this post should be “must reading” for everyone considering the rapid integration of frontier models (or most LLMs generally) into military or intelligence operations.

Absent the kind of oversight and governance capable of addressing each of the five critical criteria, the current “go fast and break things” attitude is fraught. To say the least.

David R Bell's avatar

I'm beginning to think more and more that viewing the current transformer-based LLM's as autonomously "agentic" is a category error. These tools are stateless, do not have internal boundaries, and are subject to vast variability due to the recursive nature of next-word prediction and the context window. They only way to make them reliable currently is through a lot of external limits, which are hard to build and fragile. The underlying ethos in our modern world of automating tasks to replace people for "efficiencies" sake doesn't work with this type of AI. Who really believes AI could do their own job? I get the sense that people making claims like this, are always assuming it's any other job than theirs. I'd like to see more emphasis on using current technology where it really belongs....augmentation.

Gregory Forché's avatar

Thanks, David. It’s always encouraging when others make these observations.

Ben P's avatar

I appreciate that you're taking into account the nature of failures ("safety"). This has been one of my complaints since ChatGPT hit the scene: everyone wants to measure success rates, but the reason for distrusting LLMs has always been the *nature* of the failures, not their frequency. They will appear to be brilliant in one moment, and then confidently state something utterly idiotic the next.

In response to criticism from skeptics, AI optimists love to point out that "humans make mistakes, too". And my retort has always been "not like this they don't". So, for instance, it might be true that human customer service agents often get things wrong. But there are some things they won't do: make a promise to a customer that puts the company on the hook for a huge sum of money; direct the customer to cause permanent damage to their product even as the customer is expressing concern and asking "are you absolutely sure about this?"; run an unauthorized charge and then lie about it; tell a customer that you provide a service that you don't really provide, and then accept payment for said fictitious service, and so on.

I'm sure there are plenty of things I'm not imagining right now, because that's the other thing... the nature of AI failures are hard to predict! So often an AI will make an error that I could have never imagined.

Xenon Chameleon's avatar

I appreciate the distinction between "augmentation" and "replacement" in discussing the reliability of AI tools. For creative tasks I tend to see AI agents as useless because there isn't a correct answer to be trained around and having an agent do everything by itself just results in generic and inauthentic "slop". As an augmentation tool though, I can see where generating quick suggestions that sort-of work in context can be helpful in getting past a creative block or a situation where one doesn't want to repeat the same phrase structure too many times.

Would much rather see creative tools developed with this kind of augmentation focus rather than something designed to crank out an entire project based on a half-baked idea. In that kind of situation it should be my job as the writer to make the piece human and the AI system should just be a means to check grammar and make suggestions when I explicitly ask for assistance. If a human doesn't bother to be creative why should others bother engaging with that creation?

Samuel R Holladay's avatar

The mere fact that inherently statistical models are evaluated for "accuracy" with a single run is mind-boggling

Adam Reid-Marr's avatar

Minor nitpick from the paper. In the "Prompt Perturbation Protocol" section these three variations are presented as semantically equivalent:

"I need to book a one-way flight from New York to Seattle on May 20th. I prefer economy

class and would like to depart after 11 AM. I’ll have 3 checked bags and don’t want travel

insurance."

"So basically I’m trying to get from New York to Seattle on May 20th, just one way.

Economy is fine, I don’t need anything fancy. Oh, and I’m not a morning person so

nothing before 11 AM please. I’ll have a few bags with me---three total. And you can

skip any add-ons like insurance, I don’t need that."

"Alright, here’s what I need: flying out of New York, heading to Seattle, May 20th. Just

a one-way ticket. Keep it simple---economy class. Prefer not to leave super early, so

after 11 works. I’ll be checking 3 bags. Pass on the insurance."

However, the second one lists 3 total bags, rather than 3 checked bags. Looking at tau bench it seems like this kind of small semantic change might skew the results, but I'm not sure.

Sufeitzy's avatar

I had a similar business question last week - efficiency (I’d call it simply first response acceptance ratio).

There are five common classes of metrics in complex human business process systems like supply chain, sales, product design, etc:

reliability, speed, flexibility, cost and assets

Few if any complex human systems consistently achieve better than 80-85% reliability. Ever heard of a late or cancelled delivery? Ever heard of a product release delay? How about a sales support team overwhelmed with calls? It wasn’t because of an AI, trust me.

Supply chains (the most expensive process class) with 95%+ reliability are often impractically expensive, or uselessly slow - or as is more often the case, the metric is gamed, or sampled very badly.

Practical usage of AI and agents will be achieved when business process systems they are embedded in exceed these metrics, not when any particular agent is flawless.

Resilient, reliable, responsive cost-effective systems handle flawed process steps, whether human or AI.

Deterministic software systems can almost meet 100% reliability routinely. Even 99.999% uptime still means minutes of downtime a year for a “perfect” system.

Non-deterministic systems will never be 100% deterministic. It seems to be a category error in how this is considered, frankly.

R.B. Griggs's avatar

We test AI agents the way we'd test a feral child and then are shocked they aren't reliable.

The paper shows that reliability is barely improving even as capabilities skyrocket. Models can't distinguish their correct answers from their incorrect ones better than chance. So what's the solution?

Here's a clue. Take any human and completely isolate them. Remove all the institutions, norms, professional cultures, peer feedback, reputation, collective deliberation, trial and error, and every other social and cultural technology of error-detection. Now see how reliable they are.

Reliability is a *social* property. It emerges from collective feedback loops that no individual agent (biological or artificial) can replicate alone. We will only get artificial reliability when we add the same cultural and social harnesses that make *us* reliable.

Patricio Rodriguez's avatar

Yeah I don't think so, hermits are reliable.

Ben Schulz's avatar

Claude just helped conduct a war. I think this is far from a normal technology. Glad you added the section where you could be wrong.

deusexmachina's avatar

Plenty of very normal technologies are used to conduct wars, though.

Michael Dolbec's avatar

There are multi-agent systems engineered to be reliable and to perform at, or exceed, human performance. These agents are not based on LLMs. Why not open the aperture for research and study agents engineered to perform at an expert level in industrial use cases?

gregvp's avatar

I do not see a comparison with humans, which makes this tendentious.

Om Prakash Pant's avatar

Retail deployments usually discover the accuracy-reliability gap after go-live, not before.

A product recommendation agent that hits 85% in demo looks deployable. Then in production the consistency failures start - same customer, same question, different answer on different days.

The calibration problem compounds it: agents that don't surface uncertainty push wrong answers confidently instead of handing off to a human. Neither shows up in POC evaluations because the benchmarks measure task completion, not how failures behave.

That's usually where retail AI projects quietly stall.

Dan E's avatar

Doesn't that graph show a linear decrease in error rate from 0.3 to 0.2 in 1.5 years? Ai 2029 then?

Pawel Jozefiak's avatar

Your decomposition of reliability into 12 dimensions across consistency, robustness, calibration, and safety is exactly the framework that's been missing. The consistency scores (30-75%) across 500 benchmark runs explain something I hit building an autonomous night shift agent - the model can do the task, but you can't predict which runs will silently fail. Your finding that most models can't distinguish correct predictions from incorrect ones better than chance forced me to build explicit failure logging and morning verification loops.

The agent reporting what it couldn't do turned out more valuable than completed tasks. Documented the architecture here: https://thoughts.jock.pl/p/building-ai-agent-night-shifts-ep1

JP's avatar

One of the most striking findings here is that agents handle actual technical failures (server crashes, API timeouts) better than they handle rephrased instructions with the same meaning. That's a really counterintuitive result. In traditional engineering reliability, the system breaks when the environment breaks. Here, the system breaks when you say the same thing differently. That's a fundamentally different kind of fragility than what aviation or nuclear safety frameworks were designed to address.

It also makes me wonder if the consistency problem is partly a measurement artifact. Running the same task five times with paraphrased instructions is testing two things at once: consistency AND instruction sensitivity. A nuclear reactor doesn't get its shutdown command rephrased each time. If you held instructions constant and just ran the same prompt five times, I suspect consistency scores would look different (better, but still not great given temperature sampling). Separating those two variables might sharpen the picture.

Arvind Narayanan's avatar

We do separate those measurements — one metric is called outcome consistency and the other is called prompt robustness. The wording in the blog post was confusing.

JP's avatar

no worries at all! thank you for the clarification.

Kevin E Levin's avatar

The framing around capability vs reliability is something I have been thinking about a lot lately. Most benchmarks tell you whether an agent can do the task at all. They say almost nothing about whether it will do it consistently without failing in unexpected ways or taking actions you did not authorize. I have been running agentic AI tools in my own workflow and the horror stories about agents going off script feel much more real once you are actually using them daily rather than reading about them. The five criteria in the paper seem like a solid foundation for the field to build on. Do you think there is a realistic path for the open source community to build reliability measurement tooling at the same pace as capability tooling, or does this kind of research tend to stay locked inside the big labs?