New Paper: Towards a science of AI agent…

7 hrs ago

Quantifying the capability-reliability gap

8 Comments

This is very helpful, thanks.

As someone who’s generally optimistic about the integration of AI for national security, this post should be “must reading” for everyone considering the rapid integration of frontier models (or most LLMs generally) into military or intelligence operations.

Absent the kind of oversight and governance capable of addressing each of the five critical criteria, the current “go fast and break things” attitude is fraught. To say the least.

David R Bell

I'm beginning to think more and more that viewing the current transformer-based LLM's as autonomously "agentic" is a category error. These tools are stateless, do not have internal boundaries, and are subject to vast variability due to the recursive nature of next-word prediction and the context window. They only way to make them reliable currently is through a lot of external limits, which are hard to build and fragile. The underlying ethos in our modern world of automating tasks to replace people for "efficiencies" sake doesn't work with this type of AI. Who really believes AI could do their own job? I get the sense that people making claims like this, are always assuming it's any other job than theirs. I'd like to see more emphasis on using current technology where it really belongs....augmentation.

Ben P

1hEdited

I appreciate that you're taking into account the nature of failures ("safety"). This has been one of my complaints since ChatGPT hit the scene: everyone wants to measure success rates, but the reason for distrusting LLMs has always been the *nature* of the failures, not their frequency. They will appear to be brilliant in one moment, and then confidently state something utterly idiotic the next.

In response to criticism from skeptics, AI optimists love to point out that "humans make mistakes, too". And my retort has always been "not like this they don't". So, for instance, it might be true that human customer service agents often get things wrong. But there are some things they won't do: make a promise to a customer that puts the company on the hook for a huge sum of money; direct the customer to cause permanent damage to their product even as the customer is expressing concern and asking "are you absolutely sure about this?"; run an unauthorized charge and then lie about it; tell a customer that you provide a service that you don't really provide, and then accept payment for said fictitious service, and so on.

I'm sure there are plenty of things I'm not imagining right now, because that's the other thing... the nature of AI failures are hard to predict! So often an AI will make an error that I could have never imagined.

Sufeitzy

3hEdited

I had a similar business question last week - efficiency (I’d call it simply first response acceptance ratio).

There are five common classes of metrics in complex human business process systems like supply chain, sales, product design, etc:

reliability, speed, flexibility, cost and assets

Few if any complex human systems consistently achieve better than 80-85% reliability. Ever heard of a late or cancelled delivery? Ever heard of a product release delay? How about a sales support team overwhelmed with calls? It wasn’t because of an AI, trust me.

Supply chains (the most expensive process class) with 95%+ reliability are often impractically expensive, or uselessly slow - or as is more often the case, the metric is gamed, or sampled very badly.

Practical usage of AI and agents will be achieved when business process systems they are embedded in exceed these metrics, not when any particular agent is flawless.

Resilient, reliable, responsive cost-effective systems handle flawed process steps, whether human or AI.

Deterministic software systems can almost meet 100% reliability routinely. Even 99.999% uptime still means minutes of downtime a year for a “perfect” system.

Non-deterministic systems will never be 100% deterministic. It seems to be a category error in how this is considered, frankly.

R.B. Griggs

We test AI agents the way we'd test a feral child and then are shocked they aren't reliable.

The paper shows that reliability is barely improving even as capabilities skyrocket. Models can't distinguish their correct answers from their incorrect ones better than chance. So what's the solution?

Here's a clue. Take any human and completely isolate them. Remove all the institutions, norms, professional cultures, peer feedback, reputation, collective deliberation, trial and error, and every other social and cultural technology of error-detection. Now see how reliable they are.

Reliability is a *social* property. It emerges from collective feedback loops that no individual agent (biological or artificial) can replicate alone. We will only get artificial reliability when we add the same cultural and social harnesses that make *us* reliable.

Edyn March

just now

https://avacovenant.substack.com/p/frostyshat-cc0-e64?utm_campaign=post-expanded-share&utm_medium=web

To make LLM interaction coherent, we skipped AGI and went straight to AEI.

toolate

How important is ai reliability for the surveillance state?

Om Prakash Pant

Retail deployments usually discover the accuracy-reliability gap after go-live, not before.

A product recommendation agent that hits 85% in demo looks deployable. Then in production the consistency failures start - same customer, same question, different answer on different days.

The calibration problem compounds it: agents that don't surface uncertainty push wrong answers confidently instead of handing off to a human. Neither shows up in POC evaluations because the benchmarks measure task completion, not how failures behave.

That's usually where retail AI projects quietly stall.