Open-world evaluations for measuring frontier…

Apr 16

Introducing CRUX, a new project for evaluating AI on long, messy tasks

4 Comments

Fantastic piece. I have for a while had the impression that most people don't realize how standardized the tasks that most benchmarks rely on actually are - arguably more standardized tasks than what many knowledge workers face in their work-life.

A related point from the management & economics literature: Several highly cited pieces have shown how GenAI implies performance gains for users. However, these tend to be based on relatively well-defined tasks, where goals and success criteria are clear. In our AMLE paper (https://journals.aom.org/doi/10.5465/amle.2025.0029) we set out to investigate what happens when participants are in a time-pressured situation, dealing with a more open, ill-defined (messy) task: lower performers improved with ChatGPT-4 while higher performers did not, producing an equalizing effect rather than a democratizing one. For weaker performers use of the tool reduced their cognitive burden by providing structure and plausible content - if you don't know anything, getting some relevant info is useful. Yet, for higher performers, the tool often did the opposite: it created extra material to monitor, evaluate, and integrate under time pressure, which disrupted their normal analytical workflow. In that sense, what looks like a supportive scaffold could cognitively overload for stronger performers. Thus, in more messy tasks, the bottleneck is not just what the model can produce, but whether humans can monitor and orchestrate output in messy contexts. (of course until the model can do the job completely!).

Reply (1)

The Synthesis

Dell'Acqua's BCG study found a related wrinkle: consultants using GPT-4 did better inside the "jagged frontier" but worse outside it, where plausible content became a trap on off-distribution tasks. Your equalizing result suggests weaker performers gain from scaffolding that stronger performers had already internalized, so the tool has nothing left to contribute for them.

R.F. Bryan

I think benchmarks have always been a little bit of a game. You optimize for the test and eventually the test stops meaning anything. Do you think there will ever be a point where open-world evals themselves get gamed the same way benchmarks did?

Reply (1)

The Synthesis

Probably, but the failure mode looks different. Benchmarks get gamed through optimization on the test set. Open-world evals get gamed through task selection: pick domains where success is legible (ship an app, pass a code review) and quietly skip the messier ones. The iOS app result is impressive, but I'd want to know how many App Store categories were tried before one worked, and who picked the category.