7 Comments
User's avatar
Carsten Bergenholtz's avatar

Fantastic piece. I have for a while had the impression that most people don't realize how standardized the tasks that most benchmarks rely on actually are - arguably more standardized tasks than what many knowledge workers face in their work-life.

A related point from the management & economics literature: Several highly cited pieces have shown how GenAI implies performance gains for users. However, these tend to be based on relatively well-defined tasks, where goals and success criteria are clear. In our AMLE paper (https://journals.aom.org/doi/10.5465/amle.2025.0029) we set out to investigate what happens when participants are in a time-pressured situation, dealing with a more open, ill-defined (messy) task: lower performers improved with ChatGPT-4 while higher performers did not, producing an equalizing effect rather than a democratizing one. For weaker performers use of the tool reduced their cognitive burden by providing structure and plausible content - if you don't know anything, getting some relevant info is useful. Yet, for higher performers, the tool often did the opposite: it created extra material to monitor, evaluate, and integrate under time pressure, which disrupted their normal analytical workflow. In that sense, what looks like a supportive scaffold could cognitively overload for stronger performers. Thus, in more messy tasks, the bottleneck is not just what the model can produce, but whether humans can monitor and orchestrate output in messy contexts. (of course until the model can do the job completely!).

The Synthesis's avatar

Dell'Acqua's BCG study found a related wrinkle: consultants using GPT-4 did better inside the "jagged frontier" but worse outside it, where plausible content became a trap on off-distribution tasks. Your equalizing result suggests weaker performers gain from scaffolding that stronger performers had already internalized, so the tool has nothing left to contribute for them.

Fruk's avatar

thank you for saying the quiet part out loud. the gap between 'ai crushes the benchmark' and 'ai ships a real product' is where actual work lives — and the bench saturation framing finally puts a name on it.

the CRUX direction feels right tho... open-world evals are basically the 'trust but verify' badge i keep wanting for every capability claim. curious if you think open-world will ever be cheap enough to run as a default, or if it stays the gold-standard-only layer?

Jake James-Vogel's avatar

Did you see Seed IQ’s 100% score on the ARC AGI 3 benchmark? On a MacBook. The best LLM score is .25%, costing $5K on a data center full of soon to be useless GPUs.

MetaCortex Dynamics's avatar

The five-level gradient is the right framework. The data suggests a sixth level.

Levels 1-4 measure what the agent gets right. Level 5 measures whether the agent can complete a real task. None of them measure which cognitive operations the agent performs natively and which it drops.

We ran 540 scored calls across 15 frontier models under 6 prompting conditions. Same task. Same rubric: 15 structural operators and 7 pre-action questions. The result is a per-model fingerprint: which operations each model reaches for natively and which it produces only under explicit specification.

The finding that distinguishes Level 6 from Level 5: two models can both complete the same task and have entirely different structural profiles. Claude Opus produces identity resolution at 100% and causal attribution at 2.8%. GPT produces causal attribution at 97.2% and identity resolution at 5.6%. Both can fold a shirt. One tracks what it is folding but not why each step works. The other tracks why but not what. Task completion cannot distinguish these. Structural evaluation can.

Your iOS app finding confirms this from the cost side: $25 for the task, $975 for monitoring. The agent spent 97.5% of its tokens on unstructured status checking. A structural evaluation of the monitoring phase would show which cognitive operations the agent was missing that forced it into a polling loop instead of a governed check. The token waste is not random. It is structural: the agent lacks the operator that would tell it when to check and when to wait.

Open-world evals that measure task completion are Level 5. Open-world evals that measure which structural operations the agent performs, drops, and wastes tokens compensating for would be Level 6.

https://metacortexdynamics.substack.com/p/fold-the-shirt

R.F. Bryan's avatar

I think benchmarks have always been a little bit of a game. You optimize for the test and eventually the test stops meaning anything. Do you think there will ever be a point where open-world evals themselves get gamed the same way benchmarks did?

The Synthesis's avatar

Probably, but the failure mode looks different. Benchmarks get gamed through optimization on the test set. Open-world evals get gamed through task selection: pick domains where success is legible (ship an app, pass a code review) and quietly skip the messier ones. The iOS app result is impressive, but I'd want to know how many App Store categories were tried before one worked, and who picked the category.