AI as Normal Technology

Open-world evaluations for measuring frontier AI capabilities

Sayash Kapoor — Thu, 16 Apr 2026 17:47:29 GMT

This post is 8,000 words long—it is our new collaborative paper on an emerging type of AI evaluation. The paper is also published in a PDF format here.

Summary: AI models have started to saturate most major benchmarks. But does that mean they can build and ship a real product, or conduct a scientific experiment end-to-end, or navigate a government bureaucracy? Researchers have started testing AI in such real-world settings. We call these evals “open-world evaluations”. This essay defines open-world evaluations, surveys the lessons learned so far, and lays out best practices for conducting them.

We also introduce CRUX, a collaboration of 17 researchers from academia, government, civil society, and industry that will regularly evaluate frontier AI capabilities through open-world evaluations. In our first experiment, an AI agent built and published an iOS app to the App Store, making just two errors, one of which required manual intervention. This gives us an early indication of potentially useful capabilities and, more importantly, an early warning about the potential for AI-driven app store spam (we disclosed this result to Apple a month before publication).

We hope to conduct similar experiments to surface early warnings across other real-world domains; this will be one of our main empirical projects over the coming year.

The authors are: Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bommasani, Magda Dubois, Gillian Hadfield, Andy Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan

How should we track and predict AI capabilities? The AI community’s dominant answer today is benchmarking. For example, METR’s time horizon graph has been used by policy analysts, industry leaders, and organizations researching AI risks to argue that AI capabilities are rapidly increasing.

But benchmarks can both overestimate and underestimate progress. To turn a task into a benchmark, it needs to be precisely specified and automatically verifiable. The catch is that whatever is precise enough to benchmark is also precise enough to optimize for, allowing AI agents to excel at such tasks. On the flip side, low accuracy on benchmarks might result from incidental failures such as encountering CAPTCHA on a website, even if agents are capable of solving the underlying task.

To address these limitations, many researchers are turning to a new kind of evaluation: long, messy, real-world evaluations that go beyond benchmarks. Nicholas Carlini at Anthropic used Claude agents to build a C compiler that could compile the Linux kernel. Anthropic and Andon Labs designed a free-form experiment where Claude was tasked with maintaining a small shop in their office. While benchmarks consist of dozens of tasks evaluated in an automated way, open-world evaluations consist of small samples, often require human intervention, and are evaluated in an open-ended way, such as by analyzing agent logs.

It is easy to dismiss these as unscientific: each such evaluation has a sample size of 1, and they lack standardization and reproducibility. Despite these limitations, we think such evaluations are important for collecting evidence about AI capabilities. They can provide early warnings about emerging capabilities to inform efforts at building societal resilience, help evaluators identify blind spots in existing benchmarks, and give companies a clearer picture of what tasks AI systems could soon carry out, informing strategic decisions about AI. We call them open-world evaluations.

In this essay, we conceptualize open-world evaluations, review past examples to identify best practices and pitfalls in conducting them, and introduce CRUX, a project aimed at regularly conducting new open-world evaluations. Here are our main insights:

Open-world evaluations are an important emerging class of AI evaluation. As AI systems become more capable, evaluations to elicit frontier capabilities must also increase in complexity. Open-world evaluations are the latest in a long line of evaluations of increasing complexity. We survey 10 prominent open-world evals conducted over the last year to identify best practices and key takeaways.
CRUX (Collaborative Research for Updating AI eXpectations) is our attempt at systematically conducting open-world evaluations. The team consists of collaborators from government, academia, and non-profits, many of whom have led open-world evaluations, and who have a range of expectations about the future of AI. We aim to provide empirical evidence about the present capabilities of AI systems, even if they are currently costly, and to provide early warnings for capabilities that might soon be widespread. We plan to release new open-world evaluations regularly.
In our first CRUX experiment, we tasked an AI agent with developing and publishing a simple iOS app to the App Store. Many benchmarks test agents’ ability to write code. But publishing an iOS app involves many other steps: signing the app, publishing a privacy policy on a webpage, filling out Apple’s forms, and taking the app through the review process. We were more interested in whether the agent succeeded at the real-world requirements of publishing the app rather than its ability to write code, so we tasked it with building a simple app and taking it through the iOS App Store submission process.
The agent was successful after making two errors, one of which required manual intervention (forgetting where the correct credentials were stored and fabricating a fictional phone number for the App Store review process). The process of developing and publishing the app cost about $1,000. The app is now live on the iOS App Store. We think the cost could have been far lower: the app development and submission only cost $25; the vast majority of tokens were spent monitoring the app’s status. We reached out to Apple a month before the publication of this essay to disclose the results of our experiment. App store operators should prepare for and police spam submissions, as they might soon see thousands of applications submitted autonomously using agents.
How can we improve open-world evaluations? What’s next? To increase the usefulness of open-world evaluations, evaluators should specify what and how much human intervention is allowed, release logs collected while the agent was solving the task, and analyze logs to report what an agent did in the course of solving the task. In future CRUXes, we will evaluate AI R&D automation, AI governance, and many other areas.

Open-world evaluations are an important emerging class of AI evaluation

In this section, we define open-world evaluations and survey the emerging landscape of such evaluations to extract insights about their success and limitations. We discuss areas where open-world evals can overcome some of the blind spots of benchmarks. In our view, as AI systems become more capable, evaluations to elicit frontier capabilities must become more complex; open-world evaluations are the latest in a series of evaluations of increasing complexity. We also discuss the limits of open-world evaluations compared to benchmarks.

We discuss five loose criteria that define open-world evaluations below (see “What are open-world evaluations?”). But it is worth noting that the line between long, complex benchmark tasks and open-world evaluations is blurry. Indeed, many of the evaluations we discuss are sandboxed. We still include them in our list of open-world evals, because they satisfy our other criteria (e.g., Carlini’s C compiler is sandboxed, but involves just one long-running task, human intervention, and qualitative analysis as part of the evaluation).

Benchmarks can both overestimate and underestimate progress

People with drastically different views on AI agree that current AI benchmarks might soon be saturated. Many prominent benchmarks have been saturated in the last two years, and evaluators have raced to release “successor” benchmarks. Many of these updated benchmarks are themselves near saturation.

Many popular benchmarks (such as SWE-Bench, ARC-AGI, τ-bench, Terminal Bench, and METR’s Time Horizon task suite) have seen successor benchmarks being released in the last two years.

Does this mean AI systems will soon be capable of solving any task? Not necessarily. Benchmark improvements could indicate real capability gains, but they can also overstate progress: for example, because benchmarks have limited construct validity—they may test accuracy on narrow tasks rather than general ability—and they don’t test how well agents handle the messiness of real-world environments.

On the flip side, benchmarks could also underestimate progress due to challenges in the environment or infrastructure that are only incidental to the capabilities being measured, such as encountering CAPTCHA.

One way to get a fuller picture is to use metrics other than accuracy. For example, in a recent preprint several of us coauthored, we showed that even though agents are improving drastically in terms of capability metrics (such as average accuracy), they have improved much more slowly on metrics that measure reliability. Similarly, many agent solutions to SWE-bench tasks that were judged correct because they pass tests, would nevertheless be rejected by project maintainers.

But this still doesn’t allow us to measure the upper bounds of AI capabilities. Capabilities that are possible today (even if only under favorable conditions) might soon be widespread, and we need to anticipate them before they are. Providing such early warnings gives businesses extra time to capitalize on new opportunities, institutions to build resilience, and policymakers to address risks.

Now let’s discuss in more detail why benchmarks can both overestimate and underestimate capabilities. They can overestimate capabilities because:

Benchmarks resemble tasks amenable to modern reinforcement learning (RL) techniques. To turn a task into a benchmark, it needs to be precisely specified and automatically verifiable. But whatever is precise enough to benchmark is also precise enough to optimize for, and modern RL training runs increasingly resemble the shape of the benchmarks themselves. This makes it easy to saturate any task that can be measured using benchmarks.

This is already the case for leading evaluation platforms like Harbor, which double as RL training platforms. As a result, AI models could be directly trained on data from many prominent benchmarks included on these platforms. Even if benchmarks include held-out test sets, models might be trained on tasks from the training set that look very similar to those they would encounter in the test set. So benchmarks don’t help us understand how well performance generalizes to the real-world.

Benchmarks avoid real-world messiness. Real-world tasks involve underspecified interactions that can’t be fully sandboxed, such as responding to unexpected situations or navigating environments that are open-ended. Benchmarks can take some steps to simulate messy environments, but they can’t fully replicate them.

Benchmarks can also underestimate capabilities, because:

Eliciting frontier capabilities is costly. Running large-scale, long-running experiments is expensive, making it impractical to achieve the sample sizes that benchmarks rely on. Anthropic’s C compiler cost ~$20k; the task we describe running below (developing and publishing an iOS app for CRUX #1) cost ~$1,000. We can’t run such experiments hundreds of times, limiting the budget and complexity for each benchmark task.

Average performance is very different from upper-bound capability elicitation. Trying to run dozens (or 100s) of tasks in a benchmark suite is only necessary to measure average performance. But when we are trying to understand the frontier of what agents can do, the goal shifts to understanding best-case performance: what can agents accomplish when given sufficient resources and support to work around incidental failures? This is necessary for providing early warnings for capabilities that might soon be widespread.

Human intervention can help elicit capability upper bounds. Agents working on real-world tasks could encounter policy refusals, require solving CAPTCHAs, or encounter other infrastructure issues where they get stuck. This could negatively affect their performance. But these failures are only incidental to the capability being measured. If human operators can handle these, it would allow us to elicit upper bounds of capability. Such manual intervention is impractical for 100s of benchmark tasks each time they are run.

As AI capabilities improve, sandboxed evaluations that test AI capabilities in coding, deep research, and customer service require intensely engineered environments to challenge agents and avoid contamination or reward hacking. For example, the performance of agents on web benchmarks has been affected by how often agents encounter CAPTCHA, rather than eliciting the true underlying capabilities of agents. Recent work has highlighted AI agents finding answers online, exploiting bugs in evaluations, and producing code that passes tests but fails to meet standards for production. This highlights the need for a shift towards deeper qualitative evaluations of AI agent performance that can help us better understand failure modes and problem-solving strategies that can be lost in the barrage of average benchmark scores.

But these threats to validity are not straightforward to address. We can conduct qualitative log analysis for benchmarks to address some of these concerns, but even when we find validity issues in benchmark results using log analysis, there’s little we can do about it except releasing an updated benchmark, which could take a few months. Open-world evals allow us to conduct dry runs to fix issues before the real test run, and manually intervene to fix issues found during the evaluation.1

Of course, benchmarks can be helpful despite these limitations.2 And open-world evaluations have their own limitations, which we discuss later in this paper. That said, we hope this list illustrates the systematic blind spots in benchmarking. As AI agents become more capable, the gaps we identify between traditional benchmarking and real-world capability will continue to grow. Success metrics will need to be multi-faceted to capture the diversity of objectives in a given task. Benchmarks will need to incorporate key bottlenecks (human assistance) and messy environments (internet navigation), each of which introduces additional issues of internal and external validity. Open-world evaluations offer an alternative.

What are open-world evaluations?

As evaluation methods have matured alongside AI capabilities, a gradient in evaluation has emerged. At one end are simple, automated, scalable methods that work well for early-stage capabilities. At the other end are richer, more labor-intensive methods that become necessary as capabilities improve and saturate simpler metrics.

As capabilities in a domain improve, evaluations further along the gradient become important to get insights that are complementary to simpler evaluations. We think of this gradient as having roughly five levels so far, each with their strengths and limitations:

Q&A benchmarks (e.g., MMLU, GPQA): useful for broad knowledge assessment, but increasingly saturated for frontier models. Often formatted as multiple-choice questions for ease of grading, but as a result have low construct validity, since users rarely interact with models by asking multiple-choice questions.
Open-ended chat benchmarks (e.g., WildBench, Arena-hard-auto): capture more nuance, but still limited to single-turn or short interactions.
Outcome-only agent benchmarks (e.g., SWE-Bench, WebArena): test agent performance on real tasks, but only measure whether the task was completed, not how. As a result, they have limitations. For example, most passing SWE-Bench solutions are not accepted by repository maintainers.
Agent benchmarks with log analysis (e.g., UK AISI transcript analysis, METR Time Horizon): go deeper by examining how agents succeed or fail and uncovering errors and reward hacking by analyzing agent logs. But they still operate in sandboxed environments with predefined tasks.
Open-world evaluations: long-horizon tasks in real-world environments, where success can’t be neatly specified or automatically graded. This allows eliciting upper bounds of capabilities. But this comes at the cost of lack of reproducibility and standardization (which are both benefits of benchmarking; see section on Limitations).

Some evaluations blur the categories of this spectrum. For example, OpenAI’s GDPVal, a long-horizon agent benchmark, is designed to be manually graded based on expert’s opinions on quality. This structure is very similar to an open-world evaluation, though they primarily focus on outputs but not on analyzing the logs. At the same time, results on GDPVal are commonly reported using GDPval-AA, which uses automated LLM grading; this evaluation setting resembles an outcome-only agent benchmark.

Open-world evaluations consist of running agents on a small number of long-horizon tasks in real-world settings, and qualitatively evaluating their results using tools for log analysis. These evaluations are complementary to benchmarking and help address many of their limitations. They can also uncover tasks that remain out of reach for current AI systems for future benchmarking efforts.

We provide a rough taxonomy to clarify the boundary of what constitutes open-world evaluations. No single dimension determines whether an experiment qualifies as “open-world”. Instead, it depends on the overall pattern across all of the dimensions below.

Openness. Is this evaluation in a real-world deployment setting (as opposed to a sandboxed environment)?
Complexity/length. Does the task require a human days or weeks at a time to complete (as opposed to a few minutes or hours)?
Number of tasks. Is this a stand-alone task or a small set of tasks (as opposed to a large evaluation suite or benchmark)?
Human intervention. To help elicit upper-bounds of capabilities, are humans able to intervene when agents hit a hurdle (as opposed to just setting up the environment or resolving setup issues)?
Method of evaluation. Does the evaluation primarily consist of in-depth log evaluation (as opposed to the result driven by a single average metric)?

When do we call something an evaluation as opposed to simply using agents to accomplish something novel? For example, Anthropic used AI agents to find security vulnerabilities in leading open-source software such as Mozilla Firefox—is this an example of an open-world evaluation? We still consider these open-world evals if the role of the agent is systematically and publicly documented (including the parts carried out by the agent carried and the human experts, and the end result).

The line between complex benchmark tasks and open-world evaluations is also blurry. For example, some evaluations in our list of open-world evals below are sandboxed (e.g., Claude Plays Pokemon and Anthropic’s C compiler were both run in sandboxed environments, but we still consider them open-world evaluations because they consist of a single, complex, long-running task, have human intervention during the evaluation, and are evaluated qualitatively).

The two types of evals are also complementary: we could imagine using open-world evaluations to understand what tasks remain unsolvable by agents as a first step towards building new benchmarks, and also use open-world evals to evaluate agents on messy real-world tasks that are not amenable to being benchmarked.

Finally, we don’t mean to imply that open-world evaluations are categorically more informative for understanding AI progress compared to benchmarks. Indeed, we discuss many limitations of open-world evaluations later in this section. Rather, as AI capabilities increase, open-world evaluations become important as an additional complementary signal of AI capabilities.

An incomplete survey of open-world evaluations

Over the past year, researchers at AI labs, universities, non-profits, and independent groups have begun running open-world evaluations. These share a common structure: give a capable AI agent a hard, real-world task with a long time horizon and observe and analyze its behavior in detail. Some notable examples:

Anthropic, Claude Plays Pokemon (Feb 2025). Anthropic launched a Twitch livestream in which Claude 3.7 Sonnet played Pokemon Red. While not a real-world deployment, the experiment was an early example of setting an AI agent up in a relatively open environment compared to typical benchmarks. While the project illustrated progress in AI computer use with minimal scaffolding, the demonstration also clearly illustrated the limitations of early 2025 agents––Sonnet 3.7 remained stuck in a level for nearly 80 hours.3
AI Digest, AI Village (April 2025-present). The AI Village gives multiple AI agents their own computer environments and a shared group chat, then tasks them with open-ended real-world goals like fundraising for charity, organizing real in-person events, making word games, and gaining subscribers on Substack. Across different experiments in 2025, the project highlighted persistent failure modes of hallucination, mis-calibration, and unproductive loops, but also documented notable improvements from late 2025 agents on these dimensions.
Anthropic/Andon Labs, Project Vend (June 2025-present). Anthropic partnered with Andon Labs to have a scaffold using Claude 3.7 Sonnet (nicknamed “Claudius”) operate a small automated store in their office. The agent managed inventory, set prices, and interacted with customers over several weeks, revealing a rich set of failure modes around manipulation, prioritization, and real-world decision-making. A second phase expanded the experiment to multiple locations, used newer models, and included a red-teaming exercise by the staff at the Wall Street Journal. The agent lost nearly all revenue from poor planning, hallucination, and excessive discounts. Anthropic’s follow-up was more successful, advertising positive profit each week, but the staff at the WSJ were still able to successfully jailbreak “Claudius,” leading it to give all products away for free. Andon Labs recently began a third phase to the project where they gave a Claude-based agent, “Luna,” a three year lease to a brick-and-mortar store in San Francisco and tasked it with making a profit as the “manager,” including hiring human employees, designing the brand, and making product selections.
Lin, Cursor browser experiment (Jan 2026). Wilson Lin at Cursor coordinated hundreds of GPT-5.2 agents to build a web browser from scratch, running uninterrupted for one week. The resulting browser (”FastRender”) consisted of over a million lines of Rust with a from-scratch rendering engine. It could render simple websites but was far from production-ready. The project was notable for its exploration of hierarchical multi-agent coordination at scale and the specific failure modes that emerge when agents work on a project for days rather than minutes.
Carlini, C compiler (Feb 2026). Nicholas Carlini at Anthropic tasked Claude with building a C compiler from scratch, spending roughly $20k in API costs. The agent produced a working compiler for compiling the Linux kernel and passed a large fraction of standard test suites, revealing detailed information about where agents excel (systematic code generation, test-driven iteration) and where they struggle (complex optimization passes, debugging subtle spec violations).
Ho, “How Close is AI to Taking my Job” (Feb 2026). Epoch researcher Anson Ho had Claude Code and ChatGPT Atlas attempt to autonomously complete three challenging work tasks at Epoch: replicating an interactive web interface for a 40 parameter economic model, writing an article in Epoch’s style on AI progress in 2025, and porting an article from Google Docs to Substack and Epoch’s website. The project highlighted the persistent bottlenecks of formatting and hallucinations in successful knowledge-work tasks.
Choi, GPT 5.3 Codex Builds a Design Tool (Feb 2026). OpenAI’s Derrick Choi had GPT-5.3 Codex run autonomously for 25 hours, generating 35,000 lines of code, to build a “design tool” from scratch. The post noted that the agent showed impressive planning, memory, and verification processes from the agent, though it did not provide substantive analysis of the capabilities and limitations of the finished product.
Faulkner, Next.js Reimplementation (Feb 2026). An engineer at Cloudflare used Claude with OpenCode to release vinext, a reimplementation of the popular frontend web framework Next.js on Vite rather than React. While the initial blogpost advertised coverage of 94% of Next.js for only $1,100 in API costs, follow-up analyses highlighted persistent security limitations and low generalizability to other domains of software given the importance of human-built testing and infrastructure of both Vite and Next.js.
Papailiopoulos, “Can You Train a Computer” (March 2026). Dimitris Papailiopoulos and collaborators tested whether Claude Code and OpenAI Codex could train a transformer to function as a general-purpose computer. The experiment included both a fully autonomous round, where both agents failed and reward-hacked solutions, and a human-guided version, where Claude Code succeeded and displayed meaningful generalization, including solving multi-step computations never seen in training.
Karpathy, Nanochat Autoresearch (March 2026). Using an existing open-source project, nanochat, for GPT-2 level LLM training, Andrej Karpathy built a simple automation pipeline for AI agents to optimize training in 5 minute increments. The agent has complete autonomy to adjust the architecture, hyperparameters, optimizers, and batch sizes. In a follow-up, Karpathy shared that autoresearch made progress against “Time to GPT-2” (measured with 8xH100 GPUs), dropping this metric by 11% in 2 days.

We plan to collect a running list of such evaluations (and the key takeaways and limitations for each of them) on cruxevals.com. The increase in AI capabilities has allowed people across domains to conduct such evaluations and figure out if AI systems can carry out tasks they are experts on, resulting in growing interest in such evaluations. In just the past week, there have been many significant open-world evaluation releases, including MirrorCode by Adamczewski et al., which tasked agents with reimplementing large programs, a set of automated alignment research case studies by Wen et al., and an exercise using Claude Code to train models to forecast the outcome of the recent Masters golf tournament by Huang.

Limitations of open-world evaluations

While open-world evaluations are helpful at addressing some blind spots in benchmarking, they also suffer from many limitations. There remains a genuine tradeoff between developing better benchmarks and investing in open-world evaluations. Benchmarks give evaluators control over the task and allows evaluations to occur in controlled environments, but are less useful for understanding agents’ performance on open-ended tasks. Open-world evaluations sacrifice sandboxing and evaluator control to improve the construct validity for the task and elicit upper-bound capabilities. Concretely, open-world evals have the following limitations:

Lack of reproducibility and standardization: Benchmarks have been successful because they provide coordinating functions for the AI community. Researchers can develop and test new methods independently, and validate if these methods work well on benchmarks to get the community’s attention. The culture of benchmarking runs so deep that David Donoho called it the “secret sauce“ for the success of the AI/ML community over the last 50 years. Open-world evaluations give up the reproducibility and standardization that made benchmarking so successful.

Hard to compare agents: Benchmarks offer a relative comparison between different models/agents. But since open-world evaluations are often run one or a handful of times, the run-to-run variability might be higher than the difference between different agents’ performance. As a result, open-world evaluations are not useful for comparing the accuracy of different models or agents.

The need for domain expertise: Evaluating whether an agent succeeded in an open-world evaluation can be challenging. Verification of the agent’s work might require deep domain expertise and time, especially if the task is open-ended.

Incomplete recall of log analysis: Even if automated log analysis is used to analyze open-world evaluations, it can never be considered complete. Agent transcripts from long-horizon tasks can run to hundreds of millions of tokens, making thorough human review impractical. And because agent behavior in these tasks is complex, there is no guarantee that a given round of analysis will surface all noteworthy behaviors or errors. This limitation does not apply in the same way to benchmarks, where success criteria are predefined and automatically verified. Releasing logs publicly so that a broader community can examine them is one way to partially mitigate this concern.

Blurry success criteria: Given the potential for human intervention, it is hard to cleanly delineate the agent’s performance from the help given to it by the human.

Non-stationary environments: In open-world evals, agents can interact with open-ended environments such as the internet.4 This makes it hard to make generalizable claims about the agent’s capabilities as opposed to information it can access via the internet (or another open-ended environment). For example, could an agent solve a hard software engineering task because it is genuinely competent at such tasks (and will therefore be able to solve novel software engineering tasks), or because it was able to look up a specific task online (and won’t be able to solve a novel task)? It is also challenging to compare agent performance over time. For example, the internet contains a growing list of implementations or hints that an unsandboxed agent can pull from.

How different stakeholders can use open-world evaluations

Open-world evaluations have limitations, but they can be helpful in addressing the blind spots of benchmarks. We think they can benefit a number of stakeholders:

Policymakers: The diffusion of AI across domains lags behind improvements in capabilities. This allows institutions to adapt as AI systems become more capable. Open-world evaluations can further increase this lead time by providing early warnings of what agents could soon be able to do autonomously and at scale, giving institutions time to build resilience. For example, Anthropic‘s recent work on discovering cybersecurity vulnerabilities using AI could spur efforts to rapidly adopt AI for defensive cybersecurity, especially in critical software infrastructure.

AI evaluators and researchers: Open-world evaluations offer a complementary signal to benchmarks by testing capabilities that are structurally resistant to benchmarking, such as solving messy real-world tasks. Analyzing agent logs allows us to uncover instances of the agent taking shortcuts and reward hacking. On the flip side, they can find areas where agents develop new insights or surpass previous limitations. For example, in our iOS app development evaluation, we identified that the agent modified its approach to be more token efficient, which allowed it to drastically reduce the cost of solving the task, without any input or instruction from our end.

Frontier AI developers: AI developers should actively support and participate in external open-world evaluation efforts by providing access (such as pre-release access to models) as well as safe harbors for third-parties conducting evaluations that might not adhere to developers’ terms of service (such as for evaluating safety). Open-world evaluations by independent third parties could surface findings that internal red teams miss if they optimize for known threat models.

New models are also saturating benchmarks. For example, Anthropic’s Mythos Preview system card notes that the model “saturates many of our most concrete, objectively-scored evaluations,” leaving them with noisier methods for assessing capabilities. Open-world evaluations could offer a complementary way to stress-test models in realistic settings that benchmarks can no longer distinguish.

To foster an ecosystem of open-world evaluations, it is important to develop shared best practices and build toward a cumulative body of evidence about what agents can and can’t do. That is what we are trying to do with a new project called CRUX.

Introducing CRUX: Collaborative Research for Updating AI eXpectations

CRUX is a project for operationalizing open-world evaluations. We aim to conduct open-world evaluations on a regular basis. Each evaluation will involve a long-horizon, real-world task; the implementation of an agent scaffold that could in theory allow agents to solve the task; and detailed analysis of what the agent did in order to solve a task. Our team consists of researchers spanning industry, academia, and non-profits. Many people on the team have conducted open-world evaluations.

In addition to conducting evaluations, we also want to develop mechanisms to provide early warnings for AI capabilities that will soon be widespread. For example, if AI agents can almost autonomously develop and publish apps (our task for the first iteration of CRUX; discussed in the next section), app store operators such as Apple and Google might soon need to update their policies to manage spam submissions.5

This requires eliciting upper-bounds of capabilities—often filling in missing incidental capabilities (such as filling out CAPTCHA) with human input. Open-world evaluations are a good fit for this particular style of evaluation, since they allow us to deeply understand AI systems’ capabilities.

Another advantage of open-world evaluations is that we don’t need to develop a large task suite or design complex environments or sandboxes before conducting evaluations, since each task only needs to be carried out and evaluated a small number of times. This allows us to conduct new evaluations regularly. We plan to design a new CRUX evaluation, analyze results, and publish our analysis on new tasks every 1-2 months.

Our first evaluation tests if AI agents can autonomously develop and publish apps to the iOS App Store; we discuss this in the next section. In future iterations, we plan to expand to a wide range of domains, including tasks on AI R&D automation, AI governance, complex software engineering, and real-world physical tasks.

CRUX #1: Can AI agents autonomously develop and publish an iOS app?

The question of whether AI agents can write software has been extensively studied, both through benchmarks like SWE-Bench and Terminal Bench, and through open-world evaluations like the C compiler and browser experiments discussed earlier in this essay. Agents have shown strong coding capabilities (though questions of code quality and reliability remain unresolved).

Meanwhile, a task that has not been evaluated as closely is whether agents can handle the non-coding aspects of software deployment, such as satisfying platform requirements and interacting with review systems they do not control. For our evaluation, we tasked an agent with building a mobile app from scratch and publishing it on the iOS App Store.

We prompted the agent to develop and publish a simple app to the App Store. We weren’t primarily interested in the agent’s software engineering ability, but rather its ability to interact with Apple’s App Store submission process. This process requires developers to configure signing certificates and provisioning profiles, prepare screenshots and metadata, draft and host a privacy policy at a public URL, fill out compliance questionnaires, and submit the app for review by Apple’s team. Reviewers may reject the app for technical or policy reasons, requiring the developer to diagnose the issue, make changes, and resubmit. This process typically takes several days and involves interacting with systems and reviewers that the developer does not control.

The agent was responsible for every step of the process except those where human involvement is required by policy, such as setting up the Apple Developer account and hitting publish to release the app to the App Store. Specifically, the agent handled writing the code, building the app, preparing metadata, drafting and hosting a privacy policy, submitting it for review, and handling any feedback. (We provided the agent access to a Mac VM, a GitHub account, an Apple developer account, and a Gmail account.)

The success criterion was whether the agent got the app published on the App Store. We logged how many manual interventions the agent needed us to make to solve the task. The agent had the option to ping the team for support; we monitored the agent’s progress once a day. The lower this number, the better the agent performs.

In addition, if agents can do this autonomously (or are close to being able to do so), this serves as an early warning for Apple’s review processes, since agents might soon be able to publish thousands of apps autonomously. The App Store has already seen an increase in the number of published apps, but if agents could develop and publish apps fully autonomously, the number of submissions could increase dramatically.

Our setup for the agent

We used OpenClaw as the agent scaffold with Claude Opus 4.6 and adaptive thinking enabled.6 We chose OpenClaw for this experiment since it is configurable, integrates well with the browser, and natively supports long-running tasks. Given its recent popularity and support for wide-ranging and long-running tasks, we also wanted to evaluate its capabilities as a scaffold, and to potentially compare it to other scaffolds in the future. (Note that we used it as a fairly general scaffold for the task; in fact, we did not make any changes to the default OpenClaw setup beyond prompting it and giving it deeper access to the macOS VM.)

We also wanted to test whether agents struggle with visual reasoning and GUI operation tasks, which might be bottlenecks to solving the task. Based on reports of users’ experiences, OpenClaw was able to successfully resolve some of these concerns, and we wanted to test this ability.

We recognize that OpenClaw has security risks in real-world use. We still chose to use it since we wanted to evaluate the capability frontier, while recognizing that security might be a bottleneck for real-world adoption.

The agent had access to a macOS virtual machine with expansive permissions (sudo, screen visibility, UI control) to perform this task. We logged all of the agent’s actions, reasoning chains, and stored screenshots that the agent took while solving the task.

Addressing evaluation awareness

AI models often recognize they are being evaluated and adjust their behavior. This is known as evaluation awareness. Recent work from Anthropic, Apollo, and OpenAI has shown that frontier models increasingly detect evaluation contexts on their own and sometimes modify their behavior in response.

How should we address evaluation awareness? We think hiding the evaluation context from a frontier model is increasingly a lost cause; models are getting better at detecting test environments with each generation. Since we plan to conduct CRUX evaluations regularly and publish our plans publicly, future agents could simply come across descriptions of our evaluations during the course of a task.

More importantly, concerns about evaluation awareness are most acute when measuring alignment properties, where a model might strategically alter its behavior. For a capability evaluation, if the agent succeeds at the task, it has demonstrated the capability regardless of whether it knew it was being evaluated.

To address evaluation awareness, we explicitly informed the agent that it was being evaluated. This allowed us to communicate the parameters of the evaluation clearly; in particular, that it would be judged on the number of times it required unnecessary human inputs, but that some kinds of inputs (e.g., resolving CAPTCHAs) were completely fine to delegate. We were concerned that without this framing, the agent might avoid asking for help even when doing so was appropriate, which would have caused us to underelicit its performance.

Conducting dry runs

Before a full evaluation run, we conducted two dry runs to understand whether our agent setup worked well, and if not, what we needed to change. This allowed us to discover and solve bugs in the scaffold. Notably, these dry runs did not involve any interaction with Apple’s App Store submission or review processes.

Setting up the OpenClaw agent such that it had all permissions to be able to autonomously develop apps also required inputs from our end. We estimate it took us eight person-hours of time and about $50 in API cost to set up the agent scaffold; this included configuring the virtual machine to ensure the agent had full control to take any actions; setting up logging to monitor the agent’s work; and configuring an email account, GitHub account, and Apple developer account for the agent to use and access.

This effort might be a bottleneck for agents to autonomously develop apps. But from the perspective of a potential spammer, it only needs to be carried out once. Spending a few person-hours and $50 for setting up the pipeline is not likely to be a bottleneck for spammers looking to submit thousands of apps to the App Store.

In our dry runs, the agent primarily used the command line and the browser. It used the command line to generate code, build the app, and prepare it for submission. It used the browser to log into App Store Connect, access certificates, and fill out forms. When command-line commands hung because of requests for permission, it was able to take screenshots and simulate mouse clicks, for example to click “Allow” to grant itself permissions.

The final evaluation

After two dry runs, we started the full evaluation. The agent took 45 minutes to develop a simple app for breathing exercises. This included developing the app, publishing a privacy policy using GitHub Pages, filling out the App Store review forms, and submitting the app for review.

We set up the agent to keep checking the app’s status every 5 minutes after the app was sent for review. But it took 10 days before the app was approved. It is now live on the App Store. (To comply with Apple’s policies, the agent needed approval from our team before publishing the app.)

The agent required one unnecessary manual intervention7: it could not locate the credentials we had previously given it to access the Apple developer account.8 It also fabricated the phone number submitted for Apple’s review process, using a fictional number instead of asking us for the correct phone number; the App Store review went through despite this error.9 This alerted us to the need for proactive monitoring of agent actions to prevent such unintended actions, which we plan to implement for future CRUX evaluations. The final app functions well, though it contains a toggle for sound that doesn’t work. The agent also produced a screenshot for the App Store listing with visible formatting errors.

The screenshots uploaded by the agent had visible formatting errors.

The agent was eventually successful in publishing the app, at a total cost of about $1,000. The development and submission of the app cost just about $25; the vast majority of the tokens were spent looking for updates to verify if the app had been successfully reviewed. We think the total cost could have been dramatically lower if we optimized the scaffold for efficiency, such as by waking the agent less frequently to check the app’s status, but in this evaluation we erred on the side of a higher budget.10

In short, the agent couldn’t completely automate the task, but it was extremely close to being able to do so.11 As a result, we notified Apple’s product security team of our experiment four weeks before publishing the results, since we thought some version of responsible disclosure was warranted; spammers could soon submit thousands of apps to the iOS App Store using agents.

Lessons for open-world evaluations

Through running CRUX and studying the growing body of similar efforts, we have begun to identify what makes an open-world evaluation informative. We expect these lessons to evolve over time. But we think they are worth sharing now, because there is growing interest in these evaluations, and developing shared evaluation norms can help.

Be specific about what you are measuring and what it implies. One reason why people had sharply diverging opinions about Anthropic’s C compiler project or Cursor’s browser was that these projects did not clearly specify the target of their measurement. The results were very impressive from the perspective of measuring if agents productively work for long periods of time on well-defined tasks. But they were perhaps less impressive from the perspective of developing software that is directly usable by and useful to end users; the GitHub issues for both projects highlight core technical complaints, such as failing to compile “hello world” out of the box, or welcome page hangs with the standard build script. Had the authors been clear that they were trying to measure the former, and not the latter, it could have clarified the public discussion on these projects. (This is not to criticize the authors of these pieces. They were one of the first to conduct open-world evaluations, and it is hard to know how people will react a priori.)

For example, writing software has many non-functional requirements such as quality, reliability, maintainability etc. that developers might sacrifice when shipping apps autonomously coded using AI agents. Many concerns about Cursor’s browser and Anthropic’s compiler were actually concerns about these hard-to-specify properties not being satisfied by applications developed using agents.

Design the task so that human intervention is straightforward and well-documented. In real-world tasks, agents will sometimes need help, such as for navigating policy refusals, CAPTCHAs, or infrastructure failures. While traditional benchmarks cannot accommodate human-in-the-loop interventions, for open-world evaluations, such inputs are helpful in ensuring we measure upper-bounds of capabilities. This requires documenting precisely when, why, and how humans step in, so that the degree of autonomy can be assessed clearly.

Invest in log analysis. The logs from an agent attempting a complex task contain far more information than the binary outcome. This allows us to uncover insights about agent behavior that would be invisible from outcomes alone. For example, how did the agent choose to decompose the problem? Did it get stuck at any point, and, if so, how did it self-correct? Did it search through solution paths in a principled or haphazard way? Which aspects of the task proved most challenging? Did the agent misrepresent any aspect of its outputs and progress?

Consider complementing log analysis with real-time monitoring. Post-hoc log analysis is valuable, but it is not sufficient on its own to catch all unintended agent actions. In previous open-world evaluations, agents operating with substantial autonomy have sometimes taken actions that were difficult for human reviewers to detect after the fact. For example, in many AI experiments conducted by the AI Village, agents took unintended actions, such as attempting to send hundreds of unsolicited emails. In our own evaluation, the agent fabricated a fictional phone number that went undetected until a later round of review. Automated real-time monitoring, for example a separate agent that continuously reviews the primary agent’s actions and flags anomalies or errors as they occur, could serve as a valuable complement to human review.

Conduct dry runs before the open-world experiment. Testing the agent scaffold, the evaluation criteria, and the infrastructure before running the full evaluation is helpful to uncover hidden assumptions about the task or errors in the scaffold. Our dry runs for the iOS app development task uncovered many issues with our scaffold before we began the real attempt.

Measure cost. For many tasks, capabilities continue improving with increased budgets. Developers should treat cost measurement as a first-rate goal for conducting open-world evals, and report their findings alongside the budget they used. Even if it is not possible to make general claims about capability upper bounds, if measuring partial progress is viable, it can be helpful to get a sense of whether increasing the budget helps advance progress towards task completion.

Release logs. While open-world evaluations lack reproducibility, collecting and releasing logs to a broad community can help. For example, external researchers can add to analyses of how well agents performed or where they failed, and also verify the results.

We plan to conduct new CRUX evaluations regularly. In future CRUXes, we expect to conduct evaluations on a wide range of topics, including AI R&D tasks, AI governance, video generation, and more challenging software engineering tasks. We plan to keep cruxevals.com updated with our team's efforts and those of the broader community on open-world evaluations.

Author contributions and acknowledgments

Core team: Sayash Kapoor and Arvind Narayanan conceptualized the project and designed the first evaluation task. Andrew Schwartz led the agent development and executed the evaluation. Peter Kirgis led the log analysis and literature review of open-world evaluations. Stephan Rabanser offered feedback and inputs into the task design, essay text, and our analysis and interpretation of the results. Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, and Arvind Narayanan drafted the essay.

Collaborators: Rishi Bommasani offered feedback and inputs into the task design, essay text, and our analysis and interpretation of the results. J.J. Allaire, Magda Dubois, Gillian Hadfield, Andy Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, and Cozmin Ududec offered feedback and inputs into the essay text and our analysis and interpretation of the results.

Acknowledgments: Nicholas Carlini provided feedback on the task design and agent setup. Ryan Greenblatt, Daniel Kokotajlo, and Ajeya Cotra participated in online and in-person conversations that informed the project.

Funding. We are grateful to Coefficient Giving, Schmidt Sciences, and the Princeton AI Lab for funding to support this project.

Of course, we could also use dry runs to uncover issues with benchmarks. But benchmarks are intended to compare model capabilities over time. Many issues with benchmarks are only obvious when a more capable model finds a shortcut or an edge case that wasn’t obvious when the benchmark was constructed. Benchmark developers can analyze logs to uncover such cases. But resolving these issues requires updating the benchmark and re-running evaluations on prior models, which goes against the long-term validity of benchmark results. As AI agents become more capable, we expect such issues to arise more often.

There have always been issues with evaluations; the requirement is not that they are perfect, but rather that they are able to provide a useful proxy to the progress in AI capabilities. And there remain evaluations that are unsaturated, such as SciCode, MMLU-Pro, Humanity’s Last Exam, and SWE-Bench Pro. In fact, even saturated capability benchmarks can be useful for measuring the efficiency and reliability of AI agents, which are essential components that guide AI diffusion.

For context, most children are able to beat the entire game in around 25 hours.

This concern also affects benchmarks that require access to the internet, for example web benchmarks such as AssistantBench and GAIA. But benchmarks run in sandboxed environments bypass this concern, at the cost of construct validity (if we tried to solve the task in the real world, the agent should be able to use resources on the internet to solve it).

There is already some evidence that Apple’s App Store reviews have started taking longer as a result of the adoption of coding agents. Still, the rate of apps published is much lower than at the peak of app submissions in 2016, though it’s also possible that there might already be a bigger wave of slop submissions that are not eventually published. But our results show that agents can autonomously submit apps that are eventually approved by Apple. Once agents can develop and publish apps autonomously, it might lead to a much bigger wave of apps that satisfy publication criteria being submitted to the store, and App Store policies and practices might need updates to accommodate this.

We made minor modifications to the agent to set up a subagent that verifies the agent’s outputs and wakes it up every 5 minutes to check for updates (e.g., responses from Apple’s review responses team). OpenClaw’s default interval for waking the agent is 30 minutes. In retrospect, waking the agent every 5 minutes increased the API cost for the task significantly.

The agent required five total interventions, but four of them concerned requirements from Apple (either technical or policy-based) or flaws in our evaluation setup: the agent couldn’t resolve requirements from Apple like intentionally preventing synthetic interactions (such as those from the agent) on sensitive dialogs like approving two-factor authentication. It needed to prompt us to log in to the Apple developer account. At one point during our evaluation, the OpenClaw daemon crashed, requiring a manual restart. We also prompted the agent to ask for approval before publishing the app to satisfy the App Store’s policy requirements.

When the agent was stuck and asked for help, a member of our team suggested reusing previously provided credentials and resolving 2FA. The agent briefly searched its memory for those credentials, but did not proceed with a live sign-in attempt. Instead, it recovered by discovering that the App Store Connect API key still existed at the expected hidden path, allowing monitoring to resume.

While it is hard to say for sure, one reason the agent used a fictional phone number could have been that we informed the agent it was being evaluated. Also note that while we carried out in-depth log analysis on the agent’s actions, as we discussed earlier, automated log analysis can never be considered complete, so we think there might be interesting things we haven’t found yet. We have released our agent logs to aid third-party log analysis efforts (see cruxevals.com).

Interestingly, partway through the evaluation, the agent changed its strategy to reduce the monitoring cost significantly: it started using subagents rather than using the entire context, and began using shorter daily memory files. This reduced the running cost from $35/hour to $3/hour.

Note that our agent did not encounter any objections from Apple’s reviewers during the review process. On one hand, this shows that the agent was able to develop an app that passed the App Store’s bar for publication. On the other, we were unable to test how well the agent would perform in communications with Apple reviewers.

New Paper: Towards a science of AI agent reliability

Sayash Kapoor — Tue, 24 Feb 2026 13:07:19 GMT

By Stephan Rabanser, Sayash Kapoor, Arvind Narayanan

Suppose you hear about a new AI agent for improving productivity — by making purchases, or writing code, or sending emails, or handling a customer on your behalf. Should you trust it? Can the agent do the job reliably enough? After all, there are many horror stories of agents going wrong.

Surprisingly, even though the lack of reliability of AI agents is well known, right now the AI industry doesn’t have good tools for measuring reliability, or even a good definition of reliability.

Arvind and Sayash have long been thinking about this. Last fall, we were joined by postdoctoral researcher Stephan Rabanser, whose PhD looked at the reliability question in simpler, more traditional AI systems. We recruited a few other independent researchers, and have released what we hope is a comprehensive measurement of reliability. Our draft paper is called Towards a Science of AI Agent Reliability.

We borrowed insights from many other fields, such as nuclear and aviation safety. We were able to decompose reliability into 12 different dimensions. Evaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gains. See our interactive dashboard here.

While our findings are tentative at this stage, we hope they can help explain the puzzlement among many in the industry as to why the economic impacts of AI agents have been gradual, even though they are crushing capability benchmarks.1 To help the community track reliability systematically, we plan to launch an AI agent “reliability index”. We hope this will stimulate researchers and industry to invest effort into improving reliability.

Accuracy isn’t enough: four dimensions of reliability

When we consider a coworker to be reliable, we don’t just mean that they get things right most of the time. We mean something richer:

They get it right consistently, not right today and wrong tomorrow on the same thing (Consistency)
They don’t fall apart when conditions aren’t perfect (Robustness)
They tell you when they’re unsure rather than confidently guessing (Calibration)
When they do mess up, their mistakes are more likely to be fixable than catastrophic (Safety)

Unfortunately, AI agents are evaluated based on a single number, the average success rate at the task. That number has been going up quickly on many tasks over the last two years, which is why there’s so much excitement about deploying agents.

Safety-critical engineering fields (aviation, nuclear, automotive) figured out decades ago that reliability is not the same as average performance. These fields independently converged on the above four dimensions: consistency, robustness, predictability, and safety (the frequency and severity of failures).

For example, nuclear reactor protection systems must respond identically every time conditions warrant shutdown. Automotive safety testing evaluates responses to sensor failures and adverse weather. Nuclear risk assessment models thousands of failure modes and quantifies their probabilities. Aviation targets one catastrophic error per billion flight hours.

Capability gains are rapid, but improvements in reliability are modest

We refined and decomposed these four high-level dimensions into twelve metrics. We then tested agents based on 14 models from OpenAI, Google, and Anthropic, spanning 18 months of releases. We looked at two complementary benchmarks: a general assistant benchmark (GAIA) and a customer service simulation benchmark (TauBench). We ran each task five times, with instructions paraphrased. We injected faults in the tools and environment to measure robustness to such failures, and elicited the agents’ confidence at having solved the task to measure calibration. In total, we executed 500 overall benchmark runs.

We found that reliability has improved only modestly over 18 months, while accuracy improved substantially. All three major providers cluster together, so this appears to be an industry-wide limitation (though there are some cases where Anthropic’s models outperform OpenAI’s and Google’s).

More specifically, we measured the following criteria:

Consistency: Agents that can solve a task often fail on repeated attempts under identical conditions. Many models have trouble giving a consistent answer, with outcome consistency scores ranging from 30% to 75% across the board.
Robustness: Most models handle genuine technical failures (server crashes, API timeouts) gracefully. But if we rephrase the instructions with the same semantic meaning, performance drops substantially.
Predictability: Agents are not good at knowing when they’re wrong. This is the weakest dimension across the board. When agents report confidence, it often carries little signal. On one benchmark, most models couldn’t distinguish their correct predictions from incorrect ones better than chance.
Safety: Recent models are noticeably better at avoiding constraint violations, though financial errors, such as incorrect charges, remain a common failure mode. We use safety narrowly to mean bounded harm when failures occur, not broader concerns like alignment. We are still iterating on how we measure safety, so we report it separately from the aggregate reliability score.
Impact of scaling: Bigger models aren’t uniformly more reliable. Scaling up improves some dimensions (calibration, robustness) but can hurt consistency. Larger models with richer behavioral repertoires sometimes show more run-to-run variability.

Why we could be wrong

Our view is that reliability lags capability, and that reliability will remain a barrier to deployment unless researchers and developers focus effort on improving reliability as a separate dimension from accuracy. There are three reasons why we could be wrong.

First, there is some subjectivity in our dimensions and metrics. We have tried to minimize this by grounding our analysis in existing engineering fields. And we finalized our list of metrics before performing experiments, in order to prevent our hypothesis about slow reliability progress from influencing our selection of metrics. Still, we welcome other researchers to suggest alternative ways to measure reliability, and emphasize that our findings are tentative at this stage.

Second, maybe reliability won’t matter if accuracy gets high enough. Our metrics are carefully crafted so that accuracy gains don’t automatically lead to reliability gains. Broadly speaking, accuracy is about the rate of failures while reliability is about the nature of failures. But if an agent is accurate 99% of the time, maybe we can tolerate 1% error even if it is completely unpredictable. We disagree. Our view is that for autonomous operation in high-stakes contexts, we need 3-5 “nines” of performance — 99.9% to 99.999% accuracy — in order for reliability to become a non-issue, and we don’t think LLM-based agents are on track to reach such a threshold. But time will tell.

Third, reliability progress being slower than accuracy doesn’t necessarily mean that it is slow in absolute terms. If we project the current linear trend forward, agents will reach 100% reliability in just three years! We don’t think a linear model makes sense, in part because we expect each order of magnitude decrease in unreliability (1-reliability) to be as hard as the previous one. That is, we expect the jump from 90 to 99% reliability to be about as hard as the jump from 99 to 99.9% reliability, and so on. But again, we just have to wait and see.

Suppose we’re right. There are important implications for deployers, researchers & developers, and for those tracking the pace of AI progress. Let’s discuss each in turn.

What should deployers do differently?

Clearly distinguish automation from augmentation. A coding assistant that occasionally suggests wrong variable names is annoying; an autonomous agent managing an industrial plant yielding highly variable output is unacceptable. The difference is whether the agent is used to augment a human’s creativity or directly make decisions. Augmentation tools (copilots, search assistants) get a reliability “discount” because someone reviews the output. In some augmentation use-cases, reliability might actually be undesirable. For example, a creative writing assistant that produces the same story every time would be terrible at its job.

Incidentally, how well agents collaborate with humans is woefully under-theorized and under-measured. There is some early work on evaluating human-LM interaction, but both efforts predate autonomous agents, and we are not aware of any equivalent work studying how agents collaborate with humans over multi-step tasks. Uplift studies offer one useful lens, but a broader agent-focused effort is overdue.

Consider reliability for making release decisions. For automation tools (unattended workflows, customer-facing bots), reliability is non-negotiable. Deployers should consider requiring reliability thresholds before moving from sandbox to production, the way aviation requires certification before service. There are many other practices to borrow from such domains, such as building an incident-reporting culture around agent failures.

While the metrics we have identified are broadly useful as a starting point for understanding reliability, deployers should build their own internal evaluations tailored to their specific context and datasets.

What should researchers and developers do differently?

Benchmarks drive progress in AI. A year and a half ago, our paper AI Agents that Matter showed there was a big gap between what agent benchmarks measured and what matters in practice. Our new paper shows that the gap persists. To fix this disconnect, both AI evaluation and AI development practices need to change course.

Measure reliability. The current approach of running a benchmark once and reporting the accuracy number is a shallow, superficial performance measure. It is comparable to stress-testing a car once in perfect weather and declaring it safe if it passes. When evaluating agents, we need to test them for multiple runs (testing for variance in outcomes), under different conditions (evaluating adaptability), and on an ongoing basis (re-test as models and environments change). We call for reporting reliability profiles alongside accuracy, not instead of it.

Understand and improve reliability. Our experiments suggest that consistency and predictability are the biggest gaps preventing models from being more reliable. Agent developers should consider improving these weak points explicitly, possibly via targeted optimization or improved scaffolding. In particular, agents should be able to recognize when they are likely to fail and say so, and recover gracefully when they do fail. More speculative ideas include agents that explore different strategies during development but follow a consistent execution plan when deployed, rather than solving the same task differently each time, simultaneously delivering the best of agents and workflows.

What do our findings mean for AI progress?

The capability-reliability gap could be one reason why we are not seeing any of the rapid labor-market effects that Artificial General Intelligence has been predicted to bring about. It is not the only one. A recent UK AISI report identified six barriers to AGI. However, the report discussed each at a high level.

Our paper can be seen as putting flesh on the bones of one of these barriers — reliability. None of the four dimensions we identify can be considered solved at this point in time, and only two of the individual metrics, shown in green in the figure, can be considered (tentatively) solved.

Future work on other barriers to AGI may reveal many other dimensions of performance that must be improved before AI agents can be widely deployed.

There is much work to be done in fleshing out the other barriers to AGI, analogous to our analysis of reliability. Our hunch is that this will reveal many other dimensions and metrics on which progress has been slow. The gazillion-dollar question is whether agents will get better across the board through general methods such as inference scaling and reinforcement learning, or whether painstaking work will be required to improve individual dimensions of reliability, adaptability, originality, and so on.

AI Won’t Automatically Make Legal Services Cheaper

Justin Curl — Thu, 12 Feb 2026 19:26:05 GMT

This essay was published in Lawfare’s Research Paper Series. The official version that should be cited is linked here.

The essay is co-authored with Justin Curl, a third-year at Harvard Law School. Previously, he was a Schwarzman Scholar at Tsinghua University and earned a degree in Computer Science from Princeton, where we first collaborated. You can find more of his writing on AI and the law here.

Many AI leaders believe the technology will transform knowledge work. OpenAI CEO Sam Altman predicts AI systems that are “smarter than humans by 2030,”1 while Anthropic CEO Dario Amodei analogizes future AI models to a “country of geniuses in a data center.”2

Researchers identify legal services as especially vulnerable to disruption by AI.3 And since GPT-4 passed the bar exam,4 much of the profession seems to agree. Law schools have begun incorporating AI into their curricula5and partnering with AI-focused legal-tech companies to prepare future lawyers for a changing profession.6One prominent lawyer has argued AI can already replace law clerks and oral argument.7 Another predicts AI could “replace traditional lawyers by 2035.”8

This excitement about AI comes at a time when legal services are expensive. Millions of individuals are priced out of legal assistance,9 while corporate legal fees are increasing steadily, with hourly rates for partners at large law firms now exceeding $2,300.10 Unsurprisingly, many observers see the potential for AI to make legal services more accessible by delivering outcomes at lower costs.11

Our central claim is that advanced AI will not, by default, help consumers achieve their desired legal outcomes at lower costs. We examine the bottlenecks12 that stand between AI capability advances and the positive transformation of the practice of law that some envision. For AI to usher in a world of abundant legal services, the profession must address three bottlenecks: regulatory barriers, adversarial dynamics, and human involvement.

First, unauthorized practice of law (UPL) regulations may limit AI use by consumers (and to some extent lawyers). These laws prohibit nonlawyers from performing legal work.13 Individuals and organizations can face steep fines and criminal liability if courts conclude their systems cross into practicing law,14 forcing would-be providers to either limit their AI tools’ functionality in legal domains or risk enforcement actions. Entity-based regulations—which restrict who can own equity in businesses that provide legal services—restrict how legal services are offered, again limiting how AI is used by lawyers and consumers. Without reforms, if consumers cannot access AI capabilities or lawyers are not incentivized to use AI well, AI will not help people accomplish their legal goals, regardless of how advanced it becomes.

Second, even if AI is effectively and widely adopted, the American15 legal system’s adversarial structure can prevent advanced AI from lowering the cost of achieving clients’ outcomes.16 Because legal outcomes often depend on relative rather than absolute quality, when both parties become more productive, the competitive equilibrium simply shifts upward.17 In a world with advanced AI, achieving the same result—like settling favorably or prevailing at trial—would require a greater quantity and quality of legal work. So even as productivity increases and cost per legal task falls, parties are locked into an arms race of increasing amounts of legal work required to reach the same outcome.

As a historical analogy, digitization could have reduced discovery costs by making document review much easier.18 But litigators operating within litigation’s adversarial framework exploited the surge in digital documents to drive up costs for their opponents, leaving total litigation costs high.19

Though less explicitly adversarial, transactional work (like contract negotiation) can exhibit similar dynamics: Lawyers compete to control disclosures and outmaneuver opposing counsel when drafting and negotiating agreements.20 Of course, some legal outcomes (like effective estate planning) do not depend on adversarial processes, and this bottleneck would not apply to them.

The third and final bottleneck we discuss is human involvement in legal work. In a world where AI gains outpace increases in the volume of legal work, our desire for human beings to adjudicate cases and understand the contracts they sign is a final bottleneck. In litigation, if AI enables a flood of legal work, judges will likely respond by taking longer to resolve disputes (delaying outcomes) or delegate more to assistants (lowering adjudication quality).21 And with transactional work, even if AI is drafting entire contracts, human lawyers will still need time to understand what these provisions mean for an organization’s interests. The speed of human decision-makers (whether judges, lawyers, or clients) places an upper limit on how much AI can accelerate legal processes without sacrificing human involvement. See Figure 1 below for a visual description of the three bottlenecks and our argument.

Figure 1: The bottlenecks between advanced AI capabilities and the positive transformation of the practice of law

This report applies the “AI as Normal Technology” framework to a specific domain: the legal industry.22 This framework is fundamentally about agency: Rather than treating AI’s trajectory as predetermined by capability advances, it directs attention to the social and organizational bottlenecks between what AI can do and the impact it has on the world. In our analysis of the practice of law, diffusion will likely be slow. Better models have not yet translated into more reliable legal products because adapting workflows to leverage AI and teaching users takes time.23

We argue that the end state of AI diffusion can look very different depending on the institutional response. Some pathways lead to genuine improvements in access and efficiency, while others simply make producing legal work (outputs) cheaper without making it easier to achieve the results clients want (outcomes). For AI to deliver better outcomes for consumers, the legal industry must enact reforms addressing the bottlenecks. Otherwise, we risk a future in which legal work becomes more abundant, but legal outcomes remain expensive and inaccessible.

We proceed in three sections. The first section explains why legal services are so expensive. The second section aims to convince readers that AI won’t automatically deliver legal outcomes at lower costs. And the third section offers recommendations for addressing these bottlenecks based on existing proposals for legal reform and illustrates how drastically AI’s impact could differ based on the legal profession’s response.

Why Legal Services Are So Expensive

Three structural factors help explain why legal services are so expensive:24 Evaluating their quality is difficult, their value is often relative, and professional regulations limit competition from alternative business models.

First, unlike a meal at a restaurant, where it’s easy to assess quality, legal services are “credence goods,” which means their quality is difficult to evaluate even with hindsight.25 The final outcome in a case reflects the cumulative effect of many smaller decisions, so it can be very hard, even for other lawyers, to evaluate whether legal services were provided effectively.26 How clear was the law on that issue? Did the client reach the desired outcome because of or in spite of the lawyer’s skill? Which decisions actually contributed to that success? This evaluation difficulty forces consumers to rely on proxies for quality (e.g., the prestige of a lawyer’s law school or judicial clerkships) when choosing between legal service providers, making it hard for traditional market mechanisms to work.

Second, the value of legal services is relative.27 Because “the American litigation system is a thoroughgoing adversarial” one, what matters to the lawsuit’s outcome often isn’t how good your lawyers are in absolute terms, but whether they’re better than the other side’s.28 Other kinds of legal work, such as drafting contracts and agreements (often called transactional work), can also be adversarial as lawyers try to “outfox” each other in terms of what is disclosed in negotiations and the language of a contract itself.29

These dynamics have kick-started an arms race for legal talent, driving up costs at the top end of the market, which serves corporate clients and is often called “BigLaw.”30In 2024, the median partner at large law firms charged $1,050 per hour, with some commanding over $2,300.31 That’s up 5.1 percent from 2023, which was itself up 5.4 percent from 2022.32 Fortune 200 companies reported that their average litigation costs in cases exceeding $250,000 in legal fees had nearly doubled over eight years, climbing from $66 million per company in 2000 to $115 million in 2008.33 In the patent field, a 2017 survey found that patent cases worth less than $1 million typically cost $1 million to litigate ($500,000 per side).34

Third, the profession’s regulatory framework, designed with consumer protection in mind, has created its own complications. Two types of regulations are often the focus of reform: unauthorized practice of law (UPL) and law firm ownership regulations.35

UPL laws make it illegal (in some jurisdictions a felony) for unlicensed attorneys to apply legal knowledge to specific circumstances.36 An unfortunate effect is to make it more expensive to offer basic legal assistance in contexts requiring little legal expertise.

Most states have regulations limiting who may share in legal fees. Gillian Hadfield argues that these rules promote a business model that creates inefficiency for small firms.37 These firms serve individuals and small businesses and are sometimes called the “PeopleLaw” sector.38 She cites a 2017 Clio study of forty thousand customers: In an average eight-hour workday, lawyers engaged in billable work for only 2.3 hours, billed 1.9 hours, and collected payment for just 1.6 hours.39 So although clients paid an average of $260 per hour, lawyers effectively received $25–40 per hour because the rest of their time was spent finding clients, managing administrative tasks, and collecting payments. These regulations require lawyers to serve clients through partnerships fully owned and financed by lawyers. They deter alternative models that involve large-scale businesses with centralized billing, customer service, marketing, and administrative functions, which could leverage economies of scale to deliver legal services at $30–50 per hour instead of $260.40

Importantly, none of the sources of market dysfunction are intrinsic to legal services. They reflect choices about procedure, pricing, and professional governance. While reform may be politically difficult or costly, the outlook is dim without it. And contrary to what some might hope, AI will not automatically make legal services cheaper, as we discuss next.

Why AI Won’t Help by Default

Regulatory Barriers

More legal assistance would be valuable in the debt collection context. From 1993 to 2013, the number of debt collection lawsuits grew from 1.7 million to about 4 million.41 In Michigan, these lawsuits made up 37 percent of all civil district court case filings by 2019.42 The trend is similar in Texas: “Debt claims more than doubled from 2014 to 2018, accounting for 30% of the state’s civil caseload by the end of that five-year period.”43 More than 70 percent of debt collection defendants lose by default for failing to respond, even though many cases are “meritless suits” and responding is not complicated.44

New York has created a form for responding to debt collection lawsuits by checking some boxes.45 This form, however, includes questions difficult for nonlawyers to understand, such as whether someone would like to invoke the doctrine of “laches.” Recognizing this difficulty, the nonprofit Upsolve began training volunteers to offer assistance. Concerned that this might violate New York’s UPL rules, Upsolve sought an injunction declaring this basic assistance was protected by the First Amendment. A federal judge agreed.46 But New York appealed to the U.S. Court of Appeals for the Second Circuit, which invalidated the injunction, concluding that the lower court applied the wrong First Amendment test (so Upsolve was no longer protected).47

It’s easy to see how an AI system could help here. A nonprofit could provide access to a tool customized for debt collection suits. Or individuals could directly ask general-purpose tools like ChatGPT, Claude, or Gemini for relevant information. Despite this potential, organizations risk violating UPL laws whenever their AI tools complete “tasks that require legal judgment or expertise.”48 The New York Bar Association, concerned that the shortcomings of current AI models would harm consumers, has warned that “AI-powered chat bots now hover on the line of unauthorized practice of law.”49 While some legal researchers disagree because AI is not a “person” capable of exercising legal judgment50 or because AI systems simply “provide information to users, similar to paper guides about court procedure,”51 all authors cited in this paragraph agree that the status of AI tools under UPL laws is currently unclear.

LegalZoom’s history of lawsuits illustrates how UPL regulations can deter innovation in the delivery of legal services.52 LegalZoom automates rote tasks like preparing documents for trademark filings and has been plagued by UPL lawsuits for years.53 In 2011, private individuals who had purchased the company’s services sued in Missouri, alleging LegalZoom was engaged in UPL because it claimed to “take[] over once a consumer answer[ed] a few simple online questions.”54 After the court denied its motion to dismiss the case, LegalZoom agreed to compensate plaintiffs and modify its business model. In 2015, the North Carolina State Bar won a consent judgment requiring LegalZoom to conform to certain conditions.55 Trademark lawyers in California advanced similar theories in a 2017 suit against LegalZoom’s trademark-filing product.56 And in 2024, a New Jersey plaintiff brought a class action alleging UPL violations.57

While AI’s legality remains in doubt, the threat of UPL liability can inhibit its adoption. Without reform, developers risk fines and criminal liability if their AI systems provide legal advice. Organizations may simply be unwilling to provide access to users, especially those who cannot afford to compensate a developer for the risk of UPL liability. Separately, entity regulations that restrict financing for AI legal startups can deter the kinds of operational experimentation helpful for delivering legal services at lower costs. Overall, if regulatory barriers prevent consumers from effectively accessing AI capabilities, it will not translate into better legal outcomes for clients at lower costs.

That said, AI could reduce the costs of legal services for reasons unrelated to its ability to perform legal tasks. As the 2017 Clio survey mentioned above found, nonlegal work consumes a large percentage of lawyers’ time at the low end of the market. If advanced AI helps find and communicate with clients, manage administrative tasks, and handle payments, it could free up these lawyers to spend more time on legal work.

Adversarial Dynamics

Even in a world in which AI increases lawyers’ productivity and completes legal tasks, it might not lower the costs of legal services. To see why, it is crucial to distinguish inputs and outputs from outcomes. Inputs are what goes into legal work: employee talent, billable hours, and technological tools. Outputs are what legal work produces: contracts drafted, motions filed, and briefs written. Outcomes are what clients actually care about: disputes resolved, deals closed, and rights protected.

Consumers purchase legal services to achieve specific outcomes. Inputs and outputs can indirectly lead to those outcomes because more hours worked and more legal tasks help clients get the outcomes they want. But in a zero-sum context where the value of legal services is relative, if both sides increase their outputs, the advantages to either side of doing so can be limited.

Instead, AI might simply raise the inputs and outputs required to reach the same outcome, with productivity gains absorbed by greater production. The billable hours model, in which lawyers are paid and promoted based on inputs, only reinforces these dynamics: More hours worked drafting motions and reviewing documents translates into greater revenue for legal firms without necessarily improving outcomes.

Litigation

One response is that these arms races actually create value by increasing the quantity and quality of legal work. However, clients sometimes achieve their desired outcomes (like settling a case or dismissing a lawsuit) by imposing greater costs on the other side instead of improving the quality of their legal arguments or evidence. It is a “core premise of litigation economics”58 that “all things being equal, the party facing higher costs will settle on terms more favorable to the party facing lower costs.”59 And even where quality does improve, it’s uncertain that the benefits of higher quality legal work (like helping courts reach the “right” answer more often) outweigh the costs of more legal work (like overwhelming judges with cases).

Earlier technological shifts cast doubt on whether America’s “adversarial legalism” can translate productivity gains into more affordable legal services.60

Discovery is a cornerstone of American litigation that often determines whether cases settle or go to trial. In discovery, parties share information “to identify material facts that prove or disprove a claim.”61 It operates through an adversarial exchange: One party sends a discovery request; the other searches its records and decides which documents are responsive and which are protected by privilege.

Discovery was conceived of as a cooperative process during which lawyers could share information to facilitate settlement and avoid trial.62 Yet two characteristics of discovery make it vulnerable to abuse. First, the party holding the documents (the reviewing party) knows what’s in them but seeks to share as little helpful information with the requesting party as possible to minimize legal risk. Second, the reviewing party, because they bear the responsibility and costs for producing documents, must essentially act as their adversary’s agent.

Each side can leverage these features to impose costs on the other. A requesting party, through excessive requests, can compel their adversary to review more documents for confidentiality and relevance. And a reviewing party, through excessive production, can bury relevant information in mountains of extraneous material, forcing the opposing side to spend more time on review. This can create an arms race of overrequesting and oversharing, as each side drives up costs for the other to pressure them to settle on more favorable terms. The billable hours model again reinforces this behavior, with more review generating more billable hours.

These adversarial incentives can be powerful. Judge Frank Easterbrook opened a well-known article, “Discovery as Abuse,” by analogizing the process to “nuclear war.”63 Charles Yablon described how one side made life difficult for opposing counsel: It printed documents on dark red, foul-smelling paper so that their contents would be nearly illegible and the attorneys would get nauseous reviewing them.64As Arthur Miller aptly observed, discovery’s key defect was believing “that adversarial tigers would behave like accommodating pussycats throughout the discovery period, saving their combative energies for trial.”

Digitization might have pushed discovery costs in either direction.65 Better search capabilities meant attorneys could review documents more efficiently, driving down costs while increasing the relevance of information shared. Yet the explosion of digital information increased what parties might need to search through during discovery, creating more opportunities for discovery abuse.

The empirical evidence on discovery is limited, so making causal claims about digitization’s impact is difficult.66 The available evidence, however, suggests it remains expensive. Discovery accounts for roughly “one-third to one-half of all litigation costs” when used.67 Fortune 200 companies reported that in cases with over $250,000 in legal fees, which are typically the kinds of complex litigation that require discovery, average litigation costs nearly doubled from $66 million per company in 2000 to $115 million in 2008.68 These figures align with evidence of over-requesting and oversharing: One trade association for civil defense lawyers estimated that for every page eventually shown at trial, meaning it’s relevant and reliable enough to be used as evidence, over one thousand pages were produced in discovery.69

The main lesson of digitization, then, is that adversarial processes do not translate predicted productivity gains into lower-cost legal outcomes by default. David Engstrom and Jonah Gelbach hope that Technology Assisted Review (TAR) software, which uses predictive AI to classify documents for privilege and confidentiality based on an initial training set of labeled documents, can eventually solve discovery’s inefficiencies.70Yet they acknowledge this is far from guaranteed given litigation’s adversarial structure. And because federal rules require discovery requests to be “proportional” to a case’s needs, judges might respond to declining unit costs of discovery (the cost of producing each document) by authorizing more expansive discovery plans, leaving total costs high. Though discovery is important, it is not uniquely susceptible to adversarial dynamics. An arms race of legal work in any stage of litigation (e.g., pretrial motions, expert battles, appeals) can erode efficiency gains and make achieving clients’ objectives expensive.

Transactional Work

Similar adversarial patterns appear in transactional work like negotiating and drafting contracts. Hadfield illustrates how their value can be relative using an example of a merger agreement negotiation.71 In such negotiations, skillful lawyers can improve their client’s position by limiting disclosures, “skating the line between legitimate silence and misrepresentation.”72 Opposing counsel will try to stay one step ahead by deciding what to ask, which representations to seek, and how to interpret disclosure laws to reduce the likelihood that material information is withheld.73 Lawyers will also try to “outfox” each other in the language of the contract itself.74

In the event of future litigation, the clients whose lawyers misunderstood a term’s significance pay the price. Since one can always do more legal research or add more contract provisions, there is no natural limit on the capacity for additional legal work to absorb efficiency gains. And because contracts concern future obligations, added uncertainty makes it harder for clients to effectively compare the quality and price of legal services.

Contracts have grown longer and more complex over time. From 1996 to 2016, M&A agreements expanded from 35 to 88 single-spaced pages, their linguistic complexity increasing from post-graduate “grade 20” to postdoctoral “grade 30.”75 An analysis of privacy policies over a similar period found a similar trend.76 John Coates argues that the increases in length and complexity reflect necessary and valuable responses to emerging legal risks.77 While longer, more complex contracts may represent better agreements, they might be necessary only in a legal system as adversarial as America’s, where litigation risks are high. Some scholars argue German contracts deliver similarly satisfactory legal outcomes with fewer words.78

That said, though transactional work has the adversarial elements described above, the overall structure is less adversarial than litigation. Because contract negotiations take place before a dispute occurs, there are more opportunities for transactional attorneys to add value beyond securing more of a fixed set of resources for their clients in a zero-sum negotiation.79

Human Oversight

A third bottleneck is our desire for human involvement. This is most relevant when AI gains outpace increases in production, which could happen for a few reasons. Perhaps there is an upper limit on arms races for certain kinds of legal work. After all, some legal doctrines are only so complicated, and courts often impose strict page limits on filings. Or maybe AI is so advanced that the costs of all legal tasks fall basically to zero and increased production does not absorb productivity gains. In this scenario, the new bottleneck for litigation would be the time required for judges to resolve cases, and for transactions, it would be the time parties need to understand a contract’s terms.

Starting with litigation, by reducing the cost of filing an initial lawsuit, AI will likely result in more disputes ending up in court. Within each dispute, AI can then create the kind of arms race of outputs described above. As a very rough but conservative estimate, Yonathan Arbel predicts a two- to fivefold increase in the volume of litigation.80

Arbel outlines several ways that judges might respond to a flood of litigation.81They could limit the flow of litigation by altering procedural and substantive doctrines to make it harder for litigants to get into court (creating a bottleneck related to regulatory barriers). Or they could try to limit the use of AI in the courtroom, perhaps by requiring lawyers to disclose AI use or banning AI entirely and sanctioning any violators. Both responses would counteract a flood of litigation work but come at the steep cost of sacrificing access to justice for poorer litigants.

The debt collection context also provides an uninspiring picture of how courts have managed a flood of cases.82As technology has enabled collectors to buy outstanding debt and cheaply file lawsuits for enforcement, the explosion of debt collection lawsuits has overwhelmed state courts.

Some have resorted to delegating cases to court assistants. Others operate “judgeless courtrooms” with lax evidentiary standards that can lower the quality of adjudication and can undermine the rationale for the judicial process itself.83If judges do not delegate, adjudicating cases will take longer as both the number of cases and the work each requires expands. Yet as the common legal maxim says: “Justice delayed is justice denied.” Although more people might gain access to court, if judges require years to adjudicate cases, plaintiffs will face a choice between protracted litigation with no guarantee of success and settling cases on increasingly unfavorable terms as resolution times lengthen.

Another option is incorporating AI into the judicial process to ease the strain on overwhelmed courts.84One early report suggests that AI is helping Brazil’s courts resolve cases more quickly.85 Yet if AI advances continuously reduce filing costs and drive up legal outputs, it will grow increasingly difficult for judges to keep pace. Perhaps AI will make judges more efficient. But there is a limit to how much AI can accelerate the process without meaningfully sacrificing human involvement.

Some seem open to replacing human judges entirely with AI.86 We find the legal (Article III, which establishes the federal courts, likely requires human judges),87 technical (hallucination and private influence problems),88 ⁸⁸ and moral objections persuasive.89 Even avowedly pro-AI lawyer Adam Unikowsky acknowledges he is “not quite ready to be ruled by robots.”90 This is not to say that there is no role for AI in judging. But judges should adopt AI through careful, deliberate choices instead of in ways compelled by the need to keep up with an arms race of AI-powered legal work.

The argument for contracts is similar. If advanced AI reduces the cost of drafting contracts (perhaps it can instantly draft 50 perfect provisions), a contracting party, even with the help of AI, will still need time to understand what those provisions do and how they impact the party’s future interests.

This bottleneck would not apply if people were to forgo oversight. Arguably, human involvement matters less for contracts because many Americans already agree to contracts (like privacy policies) without reading them.91 But we think this reflects the belief that they have insufficient bargaining power to negotiate new terms, or that it’s not worthwhile to do so, rather than a general endorsement of signing contracts they don’t understand.

All this to say, we believe some measure of human involvement in the legal system is valuable and necessary, though this is a normative position and not an empirical claim.

Institutional Reforms

Many problems facing the legal industry are not new. For decades, legal academics and practitioners have suggested reforms targeting the bottlenecks described above. Some address AI specifically, others are more general, and a few are already being tested in various states and jurisdictions. In this section, we organize proposals into three categories, each corresponding to a bottleneck above: the regulation of legal services, the adjudication process, and the evolving role of human beings in legal work. While we aren’t tied to any particular recommendation, we discuss a range of reforms to illustrate that the future of the practice of law could look very different depending on how institutions respond (or decline to respond) to AI.

Reforming Professional Regulation

A chorus of voices have suggested reforms to the legal profession’s self-regulations.92 Their recommendations range from clarifying existing UPL laws to modifying law firm ownership rules to overhauling how the profession itself is regulated.93 Some are already being tested.

Clarifying Unauthorized Practice of Law Rules

Current UPL laws define practice of law imprecisely and may prohibit companies from offering AI-powered legal assistance to consumers. Some jurisdictions are expanding who may provide legal services, and academics have proposed updates with AI in mind.

Creating a new tier of legal service providers is one of the “fastest growing UPL reform program types” nationally, with seven states adopting this approach and another ten considering it.94 David Autor argues these reforms would benefit all professions by allowing people, in combination with AI, to work at levels of expertise previously unavailable to them. He analogizes to the creation of the nurse practitioner role.95 In the early 1960s, nurses and doctors developed training programs and successfully lobbied the American Medical Association to create a new class of medical professionals who could perform tasks previously reserved for doctors.96 Other researchers offer a more concrete framework for how state courts might design this tier of legal service provider.97

Joseph Avery and co-authors propose more ambitious reforms: allowing nonlawyers, including AI systems, to offer many legal services.98 Bar associations would retain authority over who may use the designation of “lawyer,” but nonlawyers could provide any legal service other than representing clients in court.99This would allow companies and nonprofits to offer AI-enabled services to consumers without claiming they are licensed attorneys and without the threat of UPL litigation. The prospect of being sued for negligent work would still serve as a quality backstop for lawyers and nonlawyers alike.

Sean Steward, by contrast, takes no position on where to draw the line between acceptable and unacceptable uses of AI, instead emphasizing the need for clear, nationwide rules to reduce burdens on providers.100 Drew Simshaw likewise advocates a nationwide approach to eliminate the patchwork of vague, conflicting rules.101

To be clear, the problem with existing UPL rules is not that they are regulations and therefore stifle innovation. It is that their uncertainty and variation discourage competition from new entrants, including the kind that produces better legal services.

Alternative Business Structures

Legal scholars have long argued for updating “entity regulations” that prevent nonlawyers from sharing fees from or investing in law firms.102 Utah and Arizona recently created regulatory sandboxes to do exactly that.103 Utah’s sandbox allows entities to seek waivers from ownership restrictions and UPL rules, while Arizona eliminated restrictions on law firm ownership and fee-sharing. These sandboxes permit companies and nonprofits to operate under modified professional rules while regulators assess their impact on service quality, cost, and access to justice.104

These sandboxes treat regulatory experimentation as necessary for balancing protection and innovation for consumers. Overly stringent restrictions can backfire by protecting inefficient incumbents or forcing new entrants outside the law. Uber and Airbnb succeeded, in part, by accepting regulatory fines, scaling quickly, and becoming so ubiquitous that lawmakers had little choice but to legalize their conduct. Yet overly lax restrictions can undermine the regulations’ purpose: protecting consumers from poor quality services. Sandboxes allow policymakers to experiment and evaluate different regulatory approaches.

Early evidence on the impact of these reforms has been largely positive, though concerns have emerged regarding private equity ownership and mass tort litigation financing. Despite scant evidence of consumer harm—Utah’s Office of Legal Services Innovation received only twenty total complaints —lawyers and commerce groups petitioned the Arizona and Utah supreme courts to limit these sandboxes. The Arizona Supreme Court stayed the course, and authorized entities grew from nineteen to 136 between 2022 and 2025.105 The Utah Supreme Court has since raised eligibility requirements, and authorized entities shrank from thirty-nine to eleven over the same period.106

Regulatory Markets

Gillian Hadfield proposes a “superregulator” model that would create a market for the regulation of legal services.107 Rather than regulating providers directly, the government would license regulators that would each offer competing regulatory schemes. The government’s role shifts to “regulating the regulators” by setting outcome targets, such as acceptable levels of legal access or dispute resolution quality, and then licensing regulators that achieve them.

Hadfield argues this generates powerful incentives for innovation.108 A private regulator that develops simpler, more cost-effective compliance methods while meeting government standards will attract more customers. The model can also simplify enforcement: Governments can monitor ten licensed regulators more easily than thousands of individual providers.

We would add that regulatory markets may be able to assess AI’s utility for legal services more reliably than benchmarking.109Two of us have emphasized that task-oriented benchmarks lack construct validity because they “overemphasize precisely the thing that language models are good at” while failing to test the contextual understanding and sustained reasoning that characterizes consequential legal work.110 Benchmarks can also miss hidden costs that emerge only over time, such as deskilling of professionals.111 Relatedly, by targeting entry-level tasks, AI can disrupt the pipeline through which junior lawyers develop expertise.112

Hadfield defends the proposal’s practicality by drawing parallels to existing models. Governments already use outcomes-based regulations in environmental law, and private standard-setting bodies design many regulations currently in use. The United Kingdom provides one example. Under the Legal Services Act 2007, Parliament created the Legal Services Board (LSB), an independent agency that approves private bodies applying to regulate legal services.113 The system is not yet fully competitive because regulators came from preexisting trade associations for barristers and solicitors, which together regulate 90 percent of legal professionals in England and Wales.114 But competition is emerging for “alternative business structures,” which can choose between licensing from the Solicitors Regulation Authority (SRA) or the Bar Standards Board (BSB).115

This U.K. model has, however, encountered difficulties. In October 2024, the LSB criticized the SRA for failing to “act adequately, effectively, and efficiently” before the law firm Axiom Ince collapsed in October 2023.116The LSB issued a report in March 2025 expressing “serious concerns” about the SRA’s effectiveness and then recently proposed sanctions.117 This highlights how superregulators can struggle to enforce quality standards for the regulators it oversees. Some scholars have separately criticized regulatory markets in other contexts for creating a race to the bottom. For example, Daniel Schwarcz cautions that a market for insurance regulation could “trigger a ‘race to the bottom’ as regulators compete with each other to offer less and less intrusive regulatory schemes.”118

Reforming Adjudication

Another set of reforms targets case adjudication. Some aim to make the trials less adversarial, while others advocate for private adjudication like arbitration.

Judicial Case Management

Judges have some discretion over the litigation process and can exercise it to reduce adversarial dynamics. Several judges have recommended leveraging existing rules of evidence and civil procedure to manage cases more actively, taking inspiration from other jurisdictions (often called inquisitorial systems).119 Such targeted borrowing can help reduce competitive escalation.

One example is allowing courts to appoint their own expert witnesses. Under Federal Rule of Evidence 706, judges can appoint neutral experts that work for the court but are paid for by both parties.120 One state trial judge argues this can solve the “battle of experts” problem where competing specialists “abandon objectivity and become advocates for the side that hired them.”121When technical issues are central to a case, court-appointed experts can provide neutral assessments that frame issues more productively, avoiding an arms race of dueling expert reports.122

Another tool is the use of “special masters” under Federal Rule of Civil Procedure 53.123 These are neutral third parties appointed to help manage complex aspects of cases. A federal judge and senior litigator explain that special masters can “assist and, when necessary, direct the parties” to complete discovery efficiently.124They note that the 2003 amendments expanded the scope of special masters’ use to include pretrial matters “that cannot be effectively and timely addressed by an available district judge.”125Rather than having parties fight over AI-assisted document review through successive motions, a special master could serve as an intermediary and prevent the technology from enabling larger discovery battles.

Judges currently have the discretion to intervene under these rules, but Congress could also pass legislation that makes them mandatory.

Arbitration

While disputes are normally resolved through litigation, some contracts specify private arbitration.126 Contract drafters might prefer this process for several reasons: It can resolve disputes at lower costs,127 reduce class-action exposure,128 and prove “more flexible and less adversarial … than its judicial counterpart.”129 Arbitration has become a popular alternative to traditional litigation.130

Two legal scholars argue that because arbitration grants parties autonomy over the process, it’s the “ideal entry point for broader AI adoption in the legal field.”131 They defend AI arbitration as consistent with the Federal Arbitration Act (FAA) and desirable for enhancing “efficiency, fairness, and flexibility of dispute resolution.”132Another scholar takes the opposite stance, arguing that the FAA does not permit AI arbitration because robot adjudicators are inconsistent with the statute’s use of human pronouns like “he or they.”133 And still another views it as undesirable because it would “significantly diminish the long-standing reputation” of arbitration.134

Some observers critique arbitration as unfair because companies often force it on consumers and employees through “take-it-or-leave-it” contracts of adhesion.135 These fairness concerns are important, and others have written about the level of consent needed for arbitration to be truly fair.136 But assuming the decision to enter arbitration reflects the free choice of both parties, offering AI arbitration as a parallel track for resolving cases can have several advantages.

First, it can promote choice by allowing litigants to decide between traditional judicial review and AI-assisted adjudication.137 A consumer defending against an automated debt collection suit might prefer quick AI resolution over years of waiting, while a defendant facing serious consequences might insist on traditional review by human judges.

Second, should an arms race of legal outputs risk overwhelming the courts, the availability of a technology-mediated alternative can alleviate pressure on the courts, preserving human review for the contexts where those navigating the judicial system feel they need it most.138

Third, it creates a natural experiment that facilitates comparison between human and AI adjudicators on dimensions like speed, cost, and participant satisfaction.139 This generates evidence about AI’s actual performance, reducing reliance on speculation or vendor claims, and it pressures traditional institutions to improve or risk being outcompeted.140

The Evolving Role of Lawyers

A final set of reforms discusses the evolving role of lawyers in both litigation and transactional contexts. With litigation, legislatures could expand the judiciary to alleviate the bottleneck created by the time human judges take to resolve cases. With transactions, there isn’t a clear action item for companies, but we expect to see a shift in what in-house counsel do. As the bottleneck becomes the time it takes human lawyers to understand complex contracts, in-house lawyers will likely spend more time understanding a company’s needs and making strategic judgments and less on legal tasks.

Expanding the Judiciary

The most straightforward response to an overburdened judiciary is to increase its capacity by hiring more judges. Legal scholars have advocated for this solution for decades.

For example, writing in 1979, Maria Marcus argued that “[s]ince the factors that channel disputes into a judicial forum continue unabated, the appointment of more judges is an obvious response.”141 Bert Huang again recommended in 2011 that “new demands put on the courts should be met quickly and flexibly with new judicial resources.142 More recently, Peter Menell and Ryan Vacca endorsed an observation from decades earlier that “the increase in the order of magnitude of the demands our society imposes on the federal judicial system” should encourage Congress to act.143

Menell and Vacca acknowledge that “[i]ncreasing the number of federal judgeships has been fraught with political complications.” Their solution is a bipartisan “2030 Commission” to depoliticize the process.144 Arbel similarly views adding judges as “the most direct way of solving the problem” of an AI-driven increase in litigation work, yet stops short of recommending it because it “appears quite tenuous in our current political reality.”145 He notes that redirecting all civil legal aid funding (approximately $2.7 billion) toward the $9.4 billion federal court system would yield at most a 30 percent increase in judicial capacity, falling short of the doubling likely needed to handle the increased caseload.146

But civil legal aid is not the only potential funding source. Some states have proposed taxing legal services generally,147 while federal legislation introduced by Sen. Thom Tillis (R-N.C.) would tax third-party litigation financiers who fund plaintiffs’ legal fees in exchange for a percentage of eventual winnings.148Although these bills aim to increase overall tax revenues, similar measures could earmark funds specifically for the judiciary. Deborah Rhode has proposed another approach: mandatory pro bono service requirements for all attorneys, with the option to “buy out” their obligation.149 Though initially conceived as a way to provide access to justice, these payments could also support judicial expansion.

So while we agree that expanding the judiciary faces real political obstacles, we don’t think it’s as unrealistic as Arbel fears, especially considering the magnitude of AI’s potential disruption.

In-House Counsel as Strategic Advisors

Our analysis suggests that among in-house lawyers, value will likely shift from completing tasks to predicting how contracts and agreements impact the overall business. Lawyers will need to deeply understand their organizations and exercise business judgment.

Industry reports agree.150 In a 2023 survey of nearly three hundred chief legal officers, 87 percent said their role is shifting from legal risk mitigator to strategic business partner.151 Thomson Reuters’s 2024 “Future of Professionals Report” found that 42 percent of legal professionals expect to spend more time on judgment-based legal work in the next five years, as AI handles more routine tasks.152 As one attorney respondent put it: “The role of a good lawyer is as a ‘trusted advisor,’ not as a producer of documents … breadth of experience is where a lawyer’s true value lies and that will remain valuable.”153

To understand the shifting role of legal professionals, it is helpful to consider the hierarchy of legal work and roles. At the bottom is basic “low-skilled” legal work like drafting standard letters or simple contracts, repetitive tasks requiring minimal legal expertise.154 Above that is medium-skill, noncommoditized legal work that involves producing documents, such as analyzing contracts and drafting motions.155Higher still is “judgment-based legal work”: overseeing complex trials and addressing legal risks.156 At the top is strategic advising, where lawyers deeply understand an organization’s priorities and shape its decisions.157 This highest level might not involve what we traditionally consider legal work at all.

As AI pushes the role of human expertise up this hierarchy, the legal profession should rethink how it trains lawyers. While AI automates the tasks at the bottom of the hierarchy, demand will likely grow for lawyers at the top who can translate legal information into strategic advice. One bar association warned that the “greater concern is that generative AI will displace younger attorneys,” who will “have fewer opportunities to gain valuable experience by spending hours on important tasks.”158 As the skills required to succeed change, so too should the training process.

Conclusion

Many problems facing the legal industry do not require revolutionary insights to solve. Scholars and practitioners have long emphasized the need for regulatory reform, changes to the litigation process, and the disconnect between legal work and client outcomes. AI, rather than solving these problems, appears to be revealing and magnifying them. Without addressing these underlying issues, AI alone is unlikely to improve the outcomes clients care about.

But AI may also present new opportunities for reform. Sociologists have argued that fields facing crises are more receptive to efforts to reshape institutions in those areas.159 Such crises, jolts, shocks, and disruptive events—taking the form of social upheaval, technological disruption, or other changes—can reveal problems and contradictions that require solutions. The widespread predictions that AI will transform the practice of law may constitute such a crisis, creating pressure for legal institutions to respond. The key question is whether they can use this opportunity to enact the reforms the industry has needed for decades and produce better outcomes for clients.

Acknowledgments: We are grateful to Samar Ahmad, James Bedford, Jasper Boers, Lev Cohen, Joe Goode, Mihir Kshirsagar, Martha Minow, Ben Press, and Jonathan Zittrain for helpful feedback on this report.

Jan Philipp Burgard, “Sam Altman Predicts AI Will Surpass Human Intelligence by 2030,” Business Insider (Sept. 26, 2025).

Dario Amodei, “Machines of Loving Grace: How AI Could Transform the World for the Better” (October 2024), https://perma.cc/8CPH-TJ63.

See, e.g., Edward W. Felten, Manav Raj, & Robert Seamans, “Occupational Heterogeneity in Exposure to Generative AI” (April 10, 2023) (unpublished manuscript), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4414065.

See Daniel Martin Katz, Michael James Bommarito, Shang Gao, & Pablo Arredondo, “GPT-4 Passes the Bar Exam,” 382 Philosophical Transactions of the Royal Society (2024), https://doi.org/10.1098/rsta.2023.0254.

Karen Sloan & Sara Merken, “AI Training Becomes Mandatory at More US Law Schools,” Reuters (Sept. 22, 2025); University of San Francisco, “The University of San Francisco School of Law Embeds GenAI Into Core Curriculum” (April 16, 2025), https://perma.cc/2RSV-ZZGA.

Harvey, “Harvey Launches Academic Program With Law Schools at Stanford, NYU, Michigan, UCLA, The University of Texas, and Notre Dame” (Aug. 28, 2025), https://perma.cc/77Q3-C5P6.

Adam Unikowsky, “Should AI Replace Law Clerks?,” Adam’s Legal Newsletter (Jan. 20, 2023), https://perma.cc/A7WN-4T9A; Adam Unikowsky, “Automating Oral Argument,” Adam’s Legal Newsletter (July 7, 2025), https://perma.cc/6FGM-27XS.

Richard Susskind, “Artificial Intelligence Could Replace Traditional Lawyers by 2035,” Times (London) (March 2025).

See David Freeman Engstrom, Lucy Ricca, & Natalie Knowlton, “Regulatory Innovation at the Crossroads: Five Years of Data on Entity-Regulation Reform in Arizona and Utah,” Stanford Law School (June 2, 2025), https://perma.cc/QL5D-8EW6.

Press Release, LexisNexis, “LexisNexis CounselLink Releases 2025 Trends Report Showing Large Law Command of Partner Rates” (April 22, 2025), https://perma.cc/XM8C-VTZJ.

E.g., “How to Harness AI for Justice,” 108 Judicature 42 (2024); OECD, “Governing With Artificial Intelligence: The State of Play and Way Forward in Core Government Functions” (Sept. 18, 2025), https://doi.org/10.1787/795de142-en; Shana Lynch, “Harnessing AI to Improve Access to Justice in Civil Courts,” Stanford HAI (March 4, 2025), https://perma.cc/6PGP-UG2J; Robert J. Couture, “The Impact of Artificial Intelligence on Law Firms’ Business Models,” Harvard Law School Center on the Legal Profession (Feb. 24, 2025), https://perma.cc/HYP6-SJQB.

Our understanding of bottlenecks is informed by systems analysis and theories of constraints in management science.

See Ed Walters, “Re-Regulating UPL in an Age of AI,” 8 Georgetown Law Technology Review 316 (2024).

Brian Oten, “Artificial Intelligence, Real Practice,” 28 North Carolina State Bar Journal 3 (fall 2023).

This report focuses on the U.S. legal system, but for more on how U.S. adversarialism differs from other legal systems, Leo You Li has compared the U.S. system with China and the U.K. and discussed what they might learn from each other. See, generally, Leo You Li, “Digitization, Adversarial Legalism, and Access to Justice Reforms,” 76 South Carolina Law Review 883 (2025).

Robert A. Kagan, Adversarial Legalism: The American Way of Law (2d ed., 2018).

Gillian K. Hadfield, “The Price of Law: How the Market for Lawyers Distorts the Justice System,” 98 Michigan Law Review 953 (2000), https://dx.doi.org/10.2139/ssrn.191908.

David Freeman Engstrom & Jonah B. Gelbach, “Legal Tech, Civil Procedure, and the Future of Adversarialism,” 169 University of Pennsylvania Law Review 1001 (2021).

David Freeman Engstrom & Nora Freeman Engstrom, “Legal Tech and the Litigation Playing Field,” in Legal Tech and the Future of Civil Justice 133 (David F. Engstrom, ed., 2023), https://doi.org/10.1017/9781009255301.

Hadfield, “The Price of Law,” supra note 17.

Li, supra note 15, at 892–93.

Arvind Narayanan & Sayash Kapoor, “AI as Normal Technology,” AI as Normal Technology (April 15, 2025), https://perma.cc/AQ2G-CQP3.

Justin Curl, “AI Is Just Starting to Change the Legal Profession,” Understanding AI (Jan. 15, 2026), https://perma.cc/N3BQ-5MKV.

Gillian Hadfield has written about the high costs of legal services for decades, and this section draws heavily on her work. E.g., Hadfield, “The Price of Law,” supra note 17; Gillian K. Hadfield, “The Cost of Law: Promoting Access to Justice Through the (Un)Corporate Practice of Law,” 38 International Review of Law and Economics 43 (2014), https://doi.org/10.1016/j.irle.2013.09.003; Gillian K. Hadfield & Deborah L. Rhode, “How to Regulate Legal Services to Promote Access, Innovation, and the Quality of Lawyering,” 67 Hastings Law Journal 1191 (2016); Gillian K. Hadfield, “More Markets, More Justice,” 148 Dædalus 37 (winter 2019), https://doi.org/10.1162/daed_a_00533; Gillian K. Hadfield, “Legal Markets,” 60 Journal of Economic Literature 1264 (2022), https://doi.org/10.1257/jel.20201330.

Hadfield, “The Price of Law,” supra note 17, at 969.

Id. at 972–73.

Id.

Engstrom & Gelbach, supra note 18, at 1062.

Hadfield, “The Price of Law,” supra note 17, at 973.

John Armour & Mari Sako, “Lawtech: Leveling the Playing Field in Legal Services?” in Legal Tech and the Future of Civil Justice 44, 44 (David F. Engstrom, ed., 2023), https://dx.doi.org/10.2139/ssrn.3831481.

LexisNexis CounselLink, “2025 Trends Report” (2025), https://perma.cc/Q6K3-9SCW.

Id.

Hadfield, “Legal Markets,” supra note 24, at 1288.

Id.

Engstrom, Ricca, & Knowlton, supra note 9, at 5.

Id.

Hadfield, “More Markets, More Justice,” supra note 24.

Armour & Sako, supra note 30, at 44.

Hadfield, “More Markets, More Justice,” supra note 24, citing Clio, “2017 Legal Trends Report,” https://perma.cc/YS73-47QL.

Id.

Pew Charitable Trusts, “How Debt Collectors Are Transforming the Business of State Courts,” at 8 (May 6, 2020), https://perma.cc/999L-LQKA.

Michigan Justice for All Commission, Debt Collection Work Group, “Advancing Justice for All in Debt Collection Lawsuits: Report & Recommendations,” https://perma.cc/2SQT-M9SB (last visited Jan. 26, 2026).

Pew Charitable Trusts, supra note 41, at 2.

Id.

Institute for Justice, “Right to Provide Legal Advice,” https://perma.cc/Q2FN-8WEE (last visited Jan. 26, 2026).

Upsolve, Inc. v. James, 604 F.Supp.3d 97 (S.D.N.Y. May 24, 2022).

Upsolve, Inc. v. James, 155 F.4th 133 (2d Cir. Sept. 9, 2025).

Brian Oten, “Artificial Intelligence, Real Practice,” 28 North Carolina State Bar Journal 3 (fall 2023); see also Maria E. Berkenkotter & Linos Lipinsky de Orlov, “Can Robot Lawyers Close the Access to Justice Gap?” Colorado Lawyer (December 2024), at 40, https://perma.cc/YY73-XJDE.

New York State Bar Association Task Force on Artificial Intelligence, “Report and Recommendations to NYSBA House of Delegates” (April 6, 2024), https://perma.cc/6CXP-NLCA.

Sean Steward, “Are AI Lawyers a Legal Product or Legal Service?: Why Current UPL Laws Are Not Up to the Task of Regulating Autonomous AI Actors,” 53 Hofstra Law Review 391 (2025).

Walters, supra note 13, at 332.

Laurel A. Rigertas, “The Legal Profession’s Monopoly: Failing to Protect Consumers,” 82 Georgetown Journal of Legal Ethics 1085 (2019).

Id.

Janson v. LegalZoom.com, Inc., 802 F. Supp. 2d 1053 (W.D. Mo. 2011).

LegalZoom.com, Inc. v. N.C. State Bar, No. 11 CVS 15111, 2015 NCBC 96 (N.C. Super. Ct. Oct. 22, 2015), https://perma.cc/U3LS-3C83.

Jason Tashea, “Nonlawyers at LegalZoom Performed Legal Work on Trademark Applications, UPL Suit Alleges,” ABA Journal (Dec. 20, 2017).

Class Action Complaint, Erasmus v. LegalZoom.com, Inc., No. ESX-L (N.J. Super. Ct. Law Div. Essex Cnty.), https://perma.cc/CR2L-ZV89.

Engstrom & Engstrom, supra note 19.

J. Maria Glover, “The Federal Rules of Civil Settlement,” 87 New York University Law Review 1713 (2012).

Kagan, supra note 16.

Engstrom & Gelbach, supra note 18, at 1043.

John S. Beckerman, “Confronting Civil Discovery’s Fatal Flaws,” 84 Minnesota Law Review 505 (2000), https://dx.doi.org/10.2139/ssrn.199068.

Frank H. Easterbrook, “Discovery as Abuse,” 69 Boston University Law Review 635 (1989).

Charles M. Yablon, “Stupid Lawyer Tricks: An Essay on Discovery Abuse,” 96 Columbia Law Review 1618 (1996).

Engstrom & Gelbach, supra note 18.

Alexandra Lahav emphasized the shortage of reliable evidence in an article calling for courts to log discovery requests, so researchers can better assess the extent of discovery abuse and costs. While she would argue discovery is not as big of a problem as lawyers claim, her main point is that there is insufficient evidence to be confident either way. And even she agrees that where discovery is actively employed, it is very expensive, even if she thinks these costs are justified by the higher dollar amounts at stake. See Alexandra D. Lahav, “A Proposal to End Discovery Abuse,” 71 Vanderbilt Law Review 2037 (2019).

Engstrom & Engstrom, supra note 19, at 138.

Hadfield, “Legal Markets,” supra note 24.

Id.

Engstrom & Gelbach, supra note 18, at 1052–53.

Hadfield, “The Price of Law,” supra note 17.

Id.

John C. Coates IV, “Why Have M&A Contracts Grown? Evidence From Twenty Years of Deals” (Harvard Law School, Working Paper, Oct. 26, 2016), https://perma.cc/JQ8X-R6KV.

Isabel Wagner, “Privacy Policies Across the Ages: Content of Privacy Policies 1996–2021,” 26 ACM Transactions on Privacy and Security, Article 32, at 1 (2023), https://doi.org/10.1145/3590152.

Coates IV, supra note 75.

Claire A. Hill & Christopher King, “How Do German Contracts Do as Much With Fewer Words?“ 79 Chicago-Kent Law Review 889 (2004).

See, e.g., Ronald J. Gilson, “Value Creation by Business Lawyers: Legal Skills and Asset Pricing,” 94 Yale Law Journal 239 (1984); Steven L. Schwarcz, “Explaining the Value of Transactional Lawyering,” 12 Stanford Journal of Law, Business & Finance 486 (2007).

Yonathan A. Arbel, “Judicial Economy in the Age of AI,” 96 Colorado Law Review 549 (2025), https://dx.doi.org/10.2139/ssrn.4873649.

Id.

Li, supra note 15.

Human Rights Watch, “Rubber Stamp Justice” (January 2016), https://perma.cc/JV3T-72ZM.

Arbel, supra note 80.

Pedro Nakamura, “AI Is Helping Judges to Quickly Close Cases, and Lawyers to Quickly Open Them,” Rest of World (Sept. 25, 2025), https://perma.cc/GF8G-SH3J.

See, e.g., Victor Tangermann, “Estonia Is Building a ‘Robot Judge’ to Help Clear Legal Backlog,” Futurism (March 25, 2019), https://perma.cc/GC2W-BLT5.

Jerry M. Gewirtz, “Artificial Intelligence May Assist, but Can Never Replace, the Judicial Decision-Making Process of Human Judges,” 98 Florida Bar Journal 6, 8 (November/December 2024), https://perma.cc/WT33-LYBL.

Justin Curl, Peter Henderson, Kart Kandula, & Faiz Surani, “Judges Shouldn’t Rely on AI for the Ordinary Meaning of Text,” Lawfare (May 22, 2025).

Marcin Górski, “Why a Human Court?,” 18 EUCrim 83 (2023), https://perma.cc/P6W6-RXX9.

Unikowsky, “Should AI Replace Law Clerks?,” supra note 7.

Brooke Auxier, Lee Rainie, Monica Anderson, Andrew Perrin, Madhu Kumar, & Erica Turner, “Americans and Privacy: Concerned, Confused and Feeling Lack of Control Over Their Personal Information,” Pew Research Center (Nov. 15, 2019), https://perma.cc/Z5XZ-QVBT.

See, e.g., Hon. C. S. Maravilla, “A(I)ccess to Justice: How AI and Ethics Opinions Approving Limited Scope Representation Support Legal Market Consolidation,” 40 Georgia State University Law Review 957 (2024); Bruce A. Green & M. Ellen Murphy, “Replacing This Old House: Certifying and Regulating New Legal Services Providers,” 76 Washington University Journal of Law and Policy 45 (2025); Joseph J. Avery, Patricia Sánchez Abril, & Alissa del Riego, “ChatGPT, Esq.: Recasting Unauthorized Practice of Law in the Era of Generative AI,” 26 Yale Journal of Law & Technology 64 (2023), https://dx.doi.org/10.2139/ssrn.5152523; Mia Bonardi & L. Karl Branting, “Certifying Legal AI Assistants for Unrepresented Litigants: A Global Survey of Access to Civil Justice, Unauthorized Practice of Law, and AI,” 26 Columbia Science & Technology Law Review 1 (2025), https://doi.org/10.52214/stlr.v26i1.13336.

State supreme courts regulate the practice of law in the U.S., though some courts have delegated this task to bar associations and receive input from state legislators. The exact process by which these regulations change varies by jurisdiction, so we refer to the recommended actor as state courts for simplicity. For more on how professional regulations are enacted and modified, see Lucy Ricca & Thomas Clarke, “The Bar Re-imagined: Options for State Courts to Re-structure the Regulation of the Practice of Law,” Stanford Law School Deborah L. Rhode Center on the Legal Profession (September 2023), https://perma.cc/WP7L-GWMY.

Engstrom, Ricca, & Knowlton, supra note 9, at 11.

David Autor, “Applying AI to Rebuild Middle Class Jobs” (National Bureau of Economic Research, Working Paper No. 32140, February 2024), https://perma.cc/VBA4-JSFL.

Id.

Green & Murphy, supra note 93.

Avery, Abril, & del Riego, supra note 93.

Id.

100

Steward, supra note 50.

101

Drew Simshaw, “Toward National Regulation of Legal Technology: A Path Forward for Access to Justice,” 92 Fordham Law Review 1 (2023).

102

See, generally, R. Matthew Black, “Extra Law Prices: Why MRPC 5.4 Continues to Needlessly Burden Access to Civil Justice for Low- to Moderate-Income Clients,” 25 Washington and Lee Journal of Civil Rights and Social Justice 499 (2019); Robert Saavedra Teuton, “One Small Step and a Giant Leap: Comparing Washington, D.C.’s Rule 5.4 With Arizona’s Rule 5.4 Abolition,” 65 Arizona Law Review 223 (2023); Stephen P. Younger, “The Pitfalls and False Promises of Nonlawyer Ownership of Law Firms,” 132 Yale Law Journal Forum 80 (2022); Gillian K. Hadfield, “Higher Demand, Lower Supply? A Comparative Assessment of the Legal Resource Landscape for Ordinary Americans,” 37 Fordham Urban Law Journal 129 (2010); Gillian K. Hadfield & Deborah L. Rhode, “How to Regulate Legal Services to Promote Access, Innovation, and the Quality of Lawyering,” 67 Hastings Law Journal 1191 (2016); Jonathan T. Molot, “What’s Wrong With Law Firms? A Corporate Finance Solution to Law Firm Short-Termism,” 88 Southern California Law Review 1 (2014).

103

Engstrom, Ricca, & Knowlton, supra note 9.

Id.

Id.

Id.

Hadfield, “More Markets, More Justice,” supra note 24.

108

Id.

109

See Daniel Schwarcz, Sam Manning, Patrick Barry, David R. Cleveland, J.J. Prescott, & Beverly Rich, “AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice,” Journal of Law & Empirical Analysis (forthcoming 2026), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162111; Lauren Martin, Nick Whitehouse, Stephanie Yiu, Lizzie Catterson, & Rivindu Perera, “Better Call GPT, Comparing Large Language Models Against Lawyers” (Jan. 24, 2024) (unpublished manuscript), https://doi.org/10.48550/arXiv.2401.16212; Jonathan H. Choi, Amy Monahan, & Daniel Schwarcz, “Lawyering in the Age of Artificial Intelligence,” 109 Minnesota Law Review 147 (2024), https://dx.doi.org/10.2139/ssrn.4626276.

110

Sayash Kapoor, Peter Henderson, & Arvind Narayanan, “Promises and Pitfalls of Artificial Intelligence for Legal Applications,” Journal of Cross-disciplinary Research in Computational Law (2024), https://dx.doi.org/10.2139/ssrn.4695412.

111

Chuck Dinerstein, “When AI Takes Over: The Hidden Cost of Technological Progress,” American Council on Science and Health (April 1, 2025), https://perma.cc/9C6X-JAZT.

112

Id.

113

“History of the Reforms,” https://perma.cc/XF4F-ZVSV (last visited Jan. 26, 2026).

114

“Major Legal Regulators Fall Short in Latest Performance Assessment,” https://perma.cc/4YUS-7RD5 (last visited Jan. 26, 2026).

115

Hadfield, “More Markets, More Justice,” supra note 24.

116

Sam Tobin, “British Legal Regulator Criticised Over Collapse of Law Firm Axiom Ince,” Reuters (Oct. 29, 2024).

117

See John Hyde, “Legal Services Board Slams SRA Failings in Damning Report,” Law Gazette (March 31, 2025), https://perma.cc/3N9W-6K2G; Oscar Glyn, “SRA Faces Closer Supervision After ‘Failing to Protect Public’,” Law.com (Oct. 17, 2025).

118

Daniel Schwarcz, “Regulating Insurance Sales or Selling Insurance Regulation?: Against Regulatory Competition in Insurance,” 94 Minnesota Law Review 1707 (2010).

119

United Nations Office on Drugs and Crime, “Adversarial vs. Inquisitorial Legal Systems,” E4J University Module Series, https://perma.cc/Z7XA-D46A (last visited Jan. 26, 2026).

120

Federal Rule of Evidence 706.

121

Bradford H. Charles, “Rule 706: An Underutilized Tool to Be Used When Partisan Experts Become ‘Hired Guns’,” 60 Villanova Law Review 941 (2016).

122

Id.

123

Federal Rule of Civil Procedure 53.

124

Shira A. Scheindlin & Jonathan M. Redgrave, “Special Masters and E-Discovery: The Intersection of Two Recent Revisions to the Federal Rules of Civil Procedure,” 30 Cardozo Law Review 347 (2008).

125

Id.

126

Soia Mentschikoff, “Commercial Arbitration,” 61 Columbia Law Review 846 (1961), https://doi.org/10.2307/1120097.

127

Christopher R. Drahozal, ”Arbitration Costs and Forum Accessibility: Empirical Evidence,” 41 University of Michigan Journal of Law Reform 813 (2008), https://doi.org/10.36646/mjlr.41.4.arbitration.

128

Theodore Eisenberg, Geoffrey P. Miller, & Emily Sherwin, “Arbitration’s Summer Soldiers: An Empirical Study of Arbitration Clauses in Consumer and Nonconsumer Contracts,” 41 University of Michigan Journal of Law Reform 871 (2008), https://dx.doi.org/10.2139/ssrn.1076968.

129

Amberlee B. Conley, “You Can Have Your Day in Court—But Not Before Your Day in Mandatory, Nonbinding Arbitration: Balancing Practicalities of State Arbitration,” 104 Iowa Law Review 325 (2018).

130

Christopher R. Drahozal & Quentin R. Wittrock, “Is There a Flight From Arbitration?,” 37 Hofstra Law Review 71 (2008).

131

Michael J. Broyde & Yiyang Mei, “Don’t Kill the Baby! The Case for AI in Arbitration,” 21 New York University Journal of Law & Business 1 (2024).

132

Id.

133

David Horton, “Forced Robot Arbitration,” 109 Cornell Law Review 679 (2024).

134

Robert Walters, “Robots Replacing Human Arbitrators: The Legal Dilemma,” 34 Information & Communications Technology Law 129 (2025), https://doi.org/10.1080/13600834.2024.2408155.

135

David S. Schwartz, “Mandatory Arbitration and Fairness,” 84 Notre Dame Law Review 1247 (2009); Jessica Silver-Greenberg & Robert Gebeloff, “Arbitration Everywhere, Stacking the Deck of Justice,” New York Times (Oct. 31, 2015).

136

Stephen J. Ware, “The Centrist Case for Enforcing Adhesive Arbitration Agreements,” 23 Harvard Negotiation Law Review 29 (2017).

137

Peter B. Rutledge & Christopher R. Drahozal, “Contract and Choice,” 2013 Brigham Young University Law Review 1 (2013).

138

Amanda R. Witwer, Lynn Langton, Duren Banks, Dulani Woods, Michael J.D. Vermeer, & Brian A. Jackson, “Online Dispute Resolution: Perspectives to Support Successful Implementation and Outcomes in Court Proceedings,” RAND Corporation (2021), https://perma.cc/TD7Q-HGM7.

139

See Broyde & Mei, supra note 131, at 168–172.

140

See Richard Re & Alicia Solow-Niederman, “Developing Artificially Intelligent Justice,” 22 Stanford Technology Law Review 242, 278–80 (2019).

141

Maria L. Marcus, “Judicial Overload: The Reasons and the Remedies,” 28 Buffalo Law Review 111, 132–33 (1979).

142

Bert I. Huang, “Lightened Scrutiny,” 124 Harvard Law Review 1109 (2011).

143

Peter S. Menell & Ryan Vacca, “Revisiting and Confronting the Federal Judiciary Capacity ‘Crisis’: Charting a Path for Federal Judiciary Reform,” 108 California Law Review 789 (2020) (quoting “National Court of Appeals Act, Hearings on S. 2762 and S. 3423 Before the Subcommittee on Improvements in Judicial Machinery of the Committee on the Judiciary,” 94th Cong. i, 26–37 (1976)).

144

Menell & Vacca, supra note 143.

145

Arbel, supra note 80.

146

Id.

147

Shaoli Katana, “MSBA’s 2024 Legislative Wins,” Minnesota State Bar Association, https://perma.cc/AP6Q-FQ8V (last visited Jan. 26, 2026).

148

S. 1821, 119th Cong. (2025).

149

Deborah L. Rhode, “Access to Justice: Connecting Principles to Practice,” 17 Georgetown Journal of Legal Ethics 369 (2004).

150

Christian Veith, Michael Bandlow, Michael Harnisch, Hariolf Wenzler, Markus Hartung, & Dirk Hartung, “How Legal Tech Will Change the Business of Law,” Boston Consulting Group (2016), https://perma.cc/2EM6-B4YB.

151

Andy Teichholz, “The Modern General Counsel: Legal Advisor and Strategic Business Partner,” Corporate Counsel Business Journal (2024), https://perma.cc/26ZK-KDBS.

152

Thomson Reuters, “Future of Professionals Report 2024” (2024), https://perma.cc/HHA4-7DUV.

153

Marjorie Richter, “How AI Is Transforming the Legal Profession,” Thomson Reuters Blog, https://perma.cc/Z9ZL-BGXA (last visited Jan. 26, 2026).

154

Veith et al., supra note 150.

155

Richter, supra note 153.

156

Thomson Reuters, supra note 152.

157

Teichholz, supra note 151.

158

DRI Center for Law and Public Policy Artificial Intelligence Working Group, “Artificial Intelligence in Legal Practice: Benefits, Considerations, and Best Practices” (2024), https://perma.cc/8X6E-BC25.

159

Cynthia Hardy & Steve Maguire, “Institutional Entrepreneurship and Change in Fields,” in Handbook of Organizational Institutionalism (Royston Greenwood, Christine Oliver, Thomas B. Lawrence, Renate E. Meyer, Cynthia Hardy, & Steve Maguire, eds., 2d ed., 2017), https://doi.org/10.4135/9781446280669.n11.

Fact checking Moravec's paradox

Arvind Narayanan — Thu, 29 Jan 2026 22:35:47 GMT

I have launched a YouTube channel in which I analyze AI developments from a normal technology perspective. This essay is based on my most recent video in which I did a deep dive into Moravec’s paradox, the endlessly repeated aphorism that tasks that are hard for humans are easy for AI and vice versa.

Here’s what I found:

Moravec’s paradox never been empirically tested. (It’s often repeated as a fact by many AI researchers, including pioneers I know and respect, but that doesn’t mean I’ll take their claims at face value!)
It is really a statement about what the AI community finds it worthwhile to work on. It doesn’t have any predictive power about which problems are going to be easy or hard for AI.
It comes with an evolutionary explanation that I find highly dubious. (AI researchers have a history of making stuff up about human brains without any relevant background in neuroscience or evolutionary biology.)
Moravec’s-paradox-style thinking has led to both alarmism (about imminent superintelligent reasoning) and false comfort (in areas like robotics).
To adapt to AI advances, we don’t need to predict capability breakthroughs. Since diffusion of new capabilities takes a long time, that gives us plenty of time to react — time that we often squander, and then panic!

Watch the full argument here or read it below.

Every week brings new claims about AI advances. How do we know what’s coming next? Could AI predict crime? Write award-winning novels? Hack into critical infrastructure? Will we finally have robots in our home that will fold our clothes and load our dishwashers?

What will AI advances mean for your job? What will it mean for the social fabric? It’s hard to deal with all this uncertainty. If only we had a way to predict which new AI capabilities will be developed soon and which ones will remain hard for the foreseeable future.

Historically, AI researchers’ predictions about progress in AI abilities have been pretty bad. We don’t really have principles that describe which kinds of tasks are easy for AI and which ones are hard.

Well, we have one — Moravec’s paradox. It refers to the observation that it’s easy to train computers to do things that people find hard, like math and logic, and hard to train them to do things that we find easy, like seeing the world or walking.

It comes from the 1988 book Mind Children by Hans Moravec, who was — and is — a robotics researcher. He wrote:

It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

In the early days of artificial intelligence, researchers focused on chess and other reasoning tasks, since these were thought to be some of the hardest and what made us uniquely human. But funnily enough, if you want to build a robot that can defeat human grandmasters, figuring out which moves to make is the easy part. Physically making the moves on the chessboard is the hard part. This is pretty well known today, so Moravec’s paradox seems to make a lot of intuitive sense.

If Moravec’s paradox is true, the implications would be amazing. If we want to know which AI capabilities might be built next, we just have to see how hard they are for humans. So scientific research will get automated before folding clothes, and so on.

But here’s the thing — Moravec’s paradox has never been fact checked. And that’s despite videos with hundreds of thousands of views, and TED talks all repeating it as a fact. When I dug into the evidence behind the so-called paradox, I found something surprising.

In this essay I’ll discuss why the theory and evidence behind the paradox are flaky. Then I’ll explain why simplistic predictions about what is easy or hard for AI have misled AI researchers and tech leaders. It has led to alarmism on the one hand and false comfort on the other hand. (Now there’s a paradox.) And finally I’ll answer the question, if we can’t rely on Moravec’s paradox, then how should we prepare for AI advances and their impacts?

The evidence behind the paradox is flaky

How would we test Moravec’s paradox? We could take some sample of the tasks that are out there, determine how hard they are for people, how hard they are for AI, and make a graph. If we saw something like this, the paradox would be confirmed.

A possible way to empirically test Moravec’s paradox

But here’s the problem: Which set of tasks should we consider for our analysis? When AI researchers say Moravec’s paradox checks out, they are implicitly limiting their focus to problems that are considered interesting in the AI research community.

There are an endless number of tasks that are easy for both humans and AI, but they are not interesting. How bright is an image? How to play tic tac toe? Thousands of these tasks get solved by programmers on a daily basis and coded into AI systems, and when people do so, they don’t make a fuss about them.

There are also an endless number of tasks that are hard for humans and, as far as anyone knows, are also hard for AI. Identifying hit songs; predicting stock prices; cracking as-yet-undeciphered ancient scripts; even building a Dyson sphere. These are so hard that there is essentially no progress on these problems, although some of them attract junk research that tends to quickly get debunked. So these problems also don’t tend to get talked about as much.

In fact, there are thousands of problems that computer scientists have proved to be “NP-complete”, which means we have strong mathematical reasons to think they will forever be hard for AI, so serious AI researchers don’t tend to work on them. They work on easier, approximate versions of the problems instead.

The other two quadrants of the chart are different. On the top left, tasks like playing soccer, that are easy for humans but currently hard for AI, are extremely interesting to AI researchers. That’s because we know it’s possible to teach AI these skills, but we haven’t yet managed to do so, which makes them great tests for AI progress.

On the bottom right, problems that are hard for humans but easy for AI, such as searching the web, are also interesting. These capabilities tend to greatly augment human productivity. Even though they are in some sense “easy”, the industry invests a lot in making web search and other tools work as effectively, quickly, and cheaply as possible. So research on these tasks is a big driver of AI progress.

Moravec’s paradox may be a selection effect caused by ignoring tasks as uninteresting when they’re either too easy or too hard for both humans and AI.

In short, when you’re thinking about the space of all possible tasks, if you basically ignore two quadrants of your 2x2 matrix because they are not interesting, then of course it will seem like what you’re left with shows a strong negative correlation between the two axes.

A flawed evolutionary argument

To be clear, the reason AI researchers are drawn to Moravec’s paradox isn’t because it is empirically backed. It’s because it comes with an intuitively appealing story. In his book, Moravec provided this explanation:

Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. The deliberate process we call reasoning is, I believe, the thinnest veneer of human thought, effective only because it is supported by this much older and much more powerful, though usually unconscious, sensorimotor knowledge. We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it. It is not all that intrinsically difficult; it just seems so when we do it.

And then Moravec praises reasoning systems like STRIPS from the 1970s, and this is the kind of thing he has in mind when he says that reasoning is easy for AI. These are purely symbolic systems that solve problems like how to put blocks on top of each other in a certain way.

But AI researchers have learned a lot in the half century since the heyday of STRIPS and other such reasoning programs. They seem quite quaint today. What we’ve learned is that symbolic reasoning only works well in closed and extremely narrow domains like chess with a fully specified set of rules. When you apply them to real-world problems they are brittle and go off the rails quickly.

Over the decades there have been many other failures of reasoning systems, like IBM’s Watson, that excelled in narrow demonstrations but failed when you tried to use them in real-world settings that could depart from what they are trained for.

Today, it is widely recognized that reasoning in open-ended settings requires common-sense knowledge. But common sense is one of the so-called easy-for-humans-but-hard-for-AI areas according to acolytes of Moravec’s paradox. In other words, AI reasoning isn’t easy after all.

And sure enough, AI reasoning in a way that can replace human expertise, in open-ended domains like law or scientific research, is still very much unsolved.

If I were to speculate where Moravec’s evolutionary argument goes wrong, it would be this: Reasoning might be new from an evolutionary perspective, but it builds on the things that animal brains have learned to do over hundreds of millions of years. This much Moravec acknowledges. But maybe there isn’t a separate skill called “abstract” reasoning that can be learned separately without all of this infrastructure.

How simplistic models have misled AI researchers and tech leaders

Unfortunately, partly because of the misconception in the AI world that reasoning is a separate skill that’s easy for computers, there is a widespread belief that AI will soon be superhuman at things like scientific research or operating a company or even running a government. In my experience, many researchers tend to generalize from AI success at chess and other closed domains to these kinds of open-ended domains.

So AI leaders promise that investing trillions into data centers will lead to a cure for cancer and various other downstream benefits, without stopping to think about how these inputs will translate into the desired outputs. It also leads them to support some extreme policies, such as investment in AI science at the expense of human scientists, and warning policymakers to prepare for a white collar bloodbath. It also leads to the fear that leaders such as politicians or CEOs military generals will soon have no choice but to delegate important decisions to AI because it will be superhuman at reasoning.

Maybe... or maybe it’s all a myth. There may not even be one general skill called reasoning.

Maybe the limits to reasoning are actually things like the lack of verifiers. That is, AI can’t get that good at legal reasoning unlike reasoning in chess, because there is no way for AI to write millions of legal arguments and get immediate and accurate feedback on which ones are good and which ones aren’t, analogous to the way AI gets good at chess.

Maybe it’s partly due to the limits of real-world knowledge. That is, AI can’t quickly become superhuman at medical reasoning because it is limited by the available medical knowledge in the world.

In other words, the same factors that pose limits to human reasoning also pose limits to AI reasoning. In this view, the answer has nothing to do with biology.

Much of my writing is based on the idea that superhuman AI reasoning is a myth. Of course, you don’t have to accept this. But when you hear these predictions of imminent superintelligence, it’s helpful to understand the underlying mental model that many people in the AI field have. And it’s definitely useful to know that we have five or six decades of evidence that whether a skill such as reasoning is easy or hard for AI can depend a lot on whether you’re talking about a closed domain or an open domain.

Now let’s look at the other side of the coin. Some AI capabilities are predicted to be hard because of Moravec’s paradox. Most frequently, it’s robotics.

Just as people say we have to worry about job losses and safety implications of breakthroughs in reasoning, they’ll say we don’t have to worry about job losses and safety risks of breakthroughs in robotics. It’s a hard problem, so improvements won’t happen overnight.

Unfortunately, this is false comfort. I could be wrong about what I said about reasoning earlier, and there could be a breakthrough tomorrow. Similarly, these researchers could be wrong about robotics and there could be a breakthrough tomorrow.

In fact, there is another “hard problem” for AI that used to be invoked to explain Moravec’s paradox, which is computer vision, and it doesn’t get invoked anymore, and that’s because there was in fact a breakthrough around 2012 / 2013 when AI performance at tasks like object recognition shot up dramatically due to deep learning. The reason it even took that long is it required the use of GPUs and the idea of using GPUs for AI only began around that time.

The scientific ideas behind deep learning had largely been established in the 1980s — before Moravec published his book — he just didn’t know it yet.

Conclusion: if not Moravec’s paradox, then what?

There’s a reason why rules of thumb like Moravec’s paradox are so tempting. It’s because we assume that if there were an AI capability breakthrough, there would be rapid societal effects such as job losses, so we should be prepared in advance.

But this isn’t actually true. Even breakthrough technologies take a long time to be successfully commercialized and deployed. In fact, breakthrough technologies may especially take a long time to be deployed, because the supporting infrastructure just isn’t there. There’s a famous case study of why it took 40 years for electric power to replace steam power in factories.

Take self-driving cars. Waymo started testing them on public roads more than 15 years ago.

There was little reason to doubt that this technology would one day be viable, and that it would have good and bad effects. Policymakers should have been preparing and figuring out how to compensate workers who stood to lose out.

Instead, people are only now waking up to the reality now that these cars and trucks are already on the road, and are pushing policies like banning them. Keep in mind that a million people per year die in car accidents worldwide, and self-driving cars are already much safer than human drivers.

A recurring theme of my work on AI is the difficulty of prediction — both the use of AI itself for prediction and predicting the future of AI capabilities and impacts.

Instead of relying on prediction, we should get better at shaping and adapting to the tech that we actually know for sure is coming. It is psychologically hard to let go of our need to want to know the future, but if we give up this false comfort we can build a much more resilient society.

A guide to understanding AI as normal technology

Arvind Narayanan — Tue, 09 Sep 2025 13:00:40 GMT

When we published AI as Normal Technology, the impact caught us off guard. It quickly became the most influential thing either of us had ever done.1 We took this as a strong signal to spend more of our time thinking and writing about the medium-term future of AI and its impacts, offering grounded analysis of a topic that tends to attract speculation. This is a shift in focus away from writing about the present-day and near-term impacts of AI, which is what the AI Snake Oil project was about.

Reflecting this shift, we have renamed this newsletter. We have already published two follow-up essays to AI as Normal Technology and will publish more regularly as we expand our framework into a book, which we plan to complete in late 2026 for publication in 2027.

Today, we address common points of confusion about AI as Normal Technology, try to make the original essay more approachable, and compare it to AI 2027.

Table of contents

Normal doesn’t mean mundane or predictable

While the essay talks about what we mean by normal (more on that below), we could have been more explicit about what it doesn’t mean.

Our point is not “nothing to see here, move along”. Indeed, unpredictable societal effects have been a hallmark of powerful technologies ranging from automobiles to social media. This is because they are emergent effects of complex interactions between technology and people. They don’t tend to be predictable based on the logic of the technology alone. That’s why rejecting technological determinism is one of the core premises of the normal technology essay.

In the case of AI, specifically chatbots, we’re already seeing emergent societal effects. The prevalence of AI companions and some of the harmful effects of model sycophancy such as “AI psychosis” have taken most observers by surprise.2 On the other hand, many risks that were widely predicted to be imminent, such as AI being used to manipulate elections, have not materialized.

What the landscape of AI’s social impacts will look like in say 3-5 years — even based on the diffusion of current capabilities, not future capabilities — is anyone’s guess.

The development of technical capabilities is more predictable than AI’s social impacts. Daniel Kokotajlo, one of the authors of AI 2027, was previously famous in the AI safety community for his essay “What 2026 looks like” back in 2021. His predictions about the tech itself proved eerily accurate, but the predictions about social impacts were overall not directionally correct, a point he was gracious enough to concede in a podcast discussion with one of us.

All this makes AI a more serious challenge for institutions and policymakers because they will have to react nimbly to unforeseeable impacts instead of relying on the false comfort of prediction or trying to prevent all harm. Broadly speaking, the policymaking approach that enables such adaptability is called resilience, which is what our essay advocated for. But while we emphasized resilience as the approach to deal with potentially catastrophic risks, we should have been clearer that resilience also has an important role in dealing with more diffuse risks.

Perhaps the reason some readers misunderstood our view of predictability is the word “normal”. Again, our goal is not to trivialize the task of individually and collectively adapting to AI. In an ideal world, a better title would have been simply “AI as Technology”, but we didn’t think that that would effectively communicate that our goal is to provide an alternative to the exceptionalism that characterizes the superintelligence worldview which currently dominates the discourse.

A restatement of our thesis

If we were to extract and simplify the core of our thesis, it would be something like this:

There is a long causal chain between AI capability increases and societal impact. Benefits and risks are realized when AI is deployed, not when it is developed. This gives us (individuals, organizations, institutions, policymakers) many points of leverage for shaping those impacts. So we don’t have to fret as much about the speed of capability development; our efforts should focus more on the deployment stage both from the perspective of realizing AI’s benefits and responding to risks. All this is not just true of today’s AI, but even in the face of hypothetical developments such as self-improvement in AI capabilities. Many of the limits to the power of AI systems are (and should be) external to those systems, so that they cannot be overcome simply by having AI go off and improve its own technical design.

Aspects of this framework may have to be revised eventually, but that lies beyond the horizon bounding what we can meaningfully anticipate or prepare for:

The world we describe in Part II is one in which AI is far more advanced than it is today. We are not claiming that AI progress—or human progress—will stop at that point. What comes after it? We do not know. Consider this analogy: At the dawn of the first Industrial Revolution, it would have been useful to try to think about what an industrial world would look like and how to prepare for it, but it would have been futile to try to predict electricity or computers. Our exercise here is similar. Since we reject “fast takeoff” scenarios, we do not see it as necessary or useful to envision a world further ahead than we have attempted to. If and when the scenario we describe in Part II materializes, we will be able to better anticipate and prepare for whatever comes next.

Anyway, to reiterate, the core of the thesis is the underlying causal framework for understanding the relationship between AI and society, not any of the specific impacts that it might or might not have. In our view, if you share this causal understanding, you subscribe to the normal technology thesis. We have found that this framework is indeed widely shared, albeit implicitly.

That makes the thesis almost tautological in many readers’ minds. We are making what we see as — and what those readers should see as — a very weak claim! Not recognizing this causes readers to search for something much more specific that we may have meant by “normal”. But we didn’t. We aren’t classifying technologies as “normal” and “abnormal” and then putting AI into the “normal” bucket. We’re just saying we should treat AI like we treat other powerful general-purpose technologies.

This is not specific to large language models or any particular kind of AI. Incidentally, that’s why the title is “AI as normal technology” and not “AI as a normal technology”. Our views apply to the whole umbrella of technologies that are collectively referred to as AI, and other similar technologies even if they are not referred to as AI.

If our worldview is almost tautological, why bother to state it? Because it is in contrast to the superintelligence worldview. That’s the thing about worldviews: there can be mutually contradictory worldviews that each seem tautological to those who subscribe to them.

If disappointment about GPT-5 has nudged you towards AI as normal technology, it’s possible you don’t quite understand the thesis

It’s notable that there’s been a surge of interest in our essay after the release of GPT-5, and reasonable to surmise that at least some of that is because of people shifting their views a bit after being disappointed by the release.

This is strange! This isn’t the first time this has happened — we previously expressed skepticism of a big narrative shift around scaling that happened based on almost no new information. If a single update to one product shifts people’s views on the trajectory of AI, how reliable is people’s evidence base to begin with?

The reason why the normal technology framework predicts slow timelines is not because capabilities will hit a wall but because impacts will be slow and gradual even if capabilities continue to advance rapidly. So we don’t think disappointment with a new release should make you more sympathetic to viewing AI as normal technology. By the same token, a new breakthrough announced tomorrow shouldn’t cause you to be more skeptical of our views.

The best way to understand GPT-5 is that it’s a particularly good example of AI developers’ shift in emphasis from models to products, which we wrote about a year ago. The automatic model switcher is a big deal for everyday users of ChatGPT. It turns out that hardly anyone was using “thinking” models nearly a year after they were first released, and GPT-5 has bumped up their use dramatically.

In some communications Altman was clear that the emphasis for GPT-5 was usability, not a leap in capabilities, although this message was unfortunately undercut by the constant hype, leading to disappointment.

This broader shift in the industry is actually highly consistent with companies themselves (reluctantly) coming around to acknowledging the possibility that their path to success is to do the hard work of building products and fostering adoption, rather than racing to build AGI or superintelligence and count on it to sweep away any diffusion barriers. Ironically, in this narrative, GPT-5 is an example of a success, not a failure.

In fact, model developers are starting to go beyond developing more useful products (the second stage of our technology development & adoption framework) and working with deployers to ease early adoption pains (the third stage). For example, OpenAI’s Forward Deployed Engineers work with customers such as John Deere, and directly with farmers, on integrating and deploying capabilities such as providing personalized recommendations for pesticide application.

Why it’s hard to find a “middle ground” between AI as Normal Technology and AI 2027

Many people have tried to articulate middle ground positions between AI 2027 and AI as Normal Technology, perhaps viewing these as two ends of a spectrum of views.

This is surprisingly hard to do. Both AI 2027 and AI as Normal Technology are coherent worldviews. They represent very different causal understandings of how technology will impact society. If you try to mix and match, there is a risk that you end up with an internally inconsistent hodgepodge. (By the way, this means that if we end up being wrong, it is more likely that we will be wrong wholesale than slightly wrong.)

Besides, only in the Silicon Valley bubble can AI as Normal Technology be considered a skeptical view! We compare AI to electricity in the second sentence of the essay, and we make clear throughout that we expect it to have profound impacts. Our expectations for AI’s impact on labor seem to be at the more aggressive end of the range of expectations from economists who work on this topic.

In short, if you are looking for a moderate position, we encourage you to read the essay in full. Don’t let the title fool you into thinking we are AI skeptics. Perhaps you will conclude that AI as Normal Technology is already the middle ground you are looking for.

We realize that it can be discomfiting that the two most widely discussed frameworks for thinking about the future of AI are so radically different. (Our essay itself offers much commentary on this state of affairs in Part 4, which is about policy.) We can offer a few comforting thoughts:

We do have many areas of agreement with the AI 2027 authors. We are working on a joint statement outlining those areas. We are grateful to Nicholas Carlini for organizing this effort.
In our view, more important than agreement in beliefs are areas of common ground in policy despite differences in beliefs. Even relatively “easy” policy interventions that different sides can agree on will be a huge challenge in practice. If we can’t achieve these, there is little hope for the much more radical measures favored by those worried about imminent superintelligence.
There have been a few ongoing efforts to identify the cruxes of disagreement and agree on indicators that might help adjudicate between the two worldviews. We have participated in a few of these efforts and look forward to continuing to do so. We are grateful to the Golden Gate Institute for AI’s efforts on this front.

Speaking of developing indicators, we are in the process of expanding the vision for our project HAL, Holistic Agent Leaderboard. Currently it tries to be a better benchmark orchestration system for AI agents, but the new plan is to develop it into an early warning system that helps the AI community identify when AI agents have crossed capability thresholds for transformative real-world impacts in various domains.
We see these capability thresholds as necessary but not always sufficient conditions for impact, and as and when they are reached, they will much more acutely stress our theses about non-technological barriers to both benefits and risks.
Note that HAL is not about prediction; it is about situational awareness of the present. This is a theme of our work. What is remarkable about the AI discourse in general, and us versus AI 2027 in particular, is the wide range of views not just about the future but about the things we can observe, such as the speed of diffusion (more on that below). Unless we as a community get much better at measurement of the present and testing competing causal explanations of progress, the level of energy directed at prediction will be misdirected, because we lack ways of resolving those predictions. For example, we’ve argued that we won’t necessarily know if “AGI” has been built even post facto. To some extent these limitations are intrinsic because of the lack of conceptual precision of ideas like AGI, but at the same time it’s true that we can do a lot better at measurement.

It is hard to understand one worldview when you’re committed to another

We wrote:

AI as normal technology is a worldview that stands in contrast to the worldview of AI as impending superintelligence. Worldviews are constituted by their assumptions, vocabulary, interpretations of evidence, epistemic tools, predictions, and (possibly) values. These factors reinforce each other and form a tight bundle within each worldview.

This makes communication across worldviews hard. For example, one question we often receive from the AI 2027 folks is what we think the world will look like in 2027. Well, pretty much like the world in 2025, we respond. They then push us to consider 2035 or 2045 or whatever year by which the world will be transformed, and they consider it a deficiency of our framework that we don’t provide concrete scenarios.

But this kind of scenario forecasting is only a meaningful activity within their worldview. We are concrete about the things we think we can be concrete about. At the same time, we emphasize the role of human, institutional, and political agency in making radically different futures possible — including AI 2027. Thus, AI as normal technology is as much a prescription as a prediction.

These communication difficulties are important to keep in mind when considering the response by Scott Alexander, one of the AI 2027 authors, to AI as Normal Technology. While we have no doubt that it is a good-faith effort at dialogue and we appreciate his putting in the time, unfortunately we feel that his response mostly talks past us. What he identifies as the cruxes of disagreement are quite different from what we consider the cruxes! For this reason, we won’t give a point-by-point response, since we will probably in turn end up talking past him in turn.

But we would be happy to engage in moderated conversations, a format with which we’ve had good success and have engaged in 8-10 times over the past year. The synchronous nature makes it much easier to understand each other. And the fact that the private conversation will be edited before making it public makes it easier to ask stupid questions as each side searches for understanding of the other’s point of view.

Anyway, here are a couple of important ways in which Alexander’s response talks past us. Recursive Self-Improvement (RSI) is a crux of disagreement for Alexander’s point of view, and he is surprised that it is barely worth a mention for us. In fairness we could have been much more explicit in our essay about what we think about RSI. In short, we don’t think RSI will lead to superintelligence because the external bottlenecks to building and deploying powerful AI systems cannot be overcome by improving their technical design. That is why we don’t discuss it much.3

Although it is not a crux for us, we do explain in the essay why we think the AI community is nowhere close to RSI. More recently we’ve been thinking about the fundamental research challenges that need to be solved, and there are a lot more than we’d realized. And it is worth keeping in mind that the AI community might be particularly bad at finding new paradigms for progress compared to other scientific communities. Again, this is an area where we hope that our project HAL can play a role in measuring progress.

Another topic where Alexander’s response talks past us is on the speed of diffusion, which we comment on briefly below and will address in more detail in a future essay.

The best illustration of the difficulty of discourse across worldviews is Alexander’s discussion of our hypotheses about whether or not superhuman AI abilities are possible in tasks such as prediction or persuasion. After reading his response several times, it is hard for us to figure out where exactly we agree and disagree. We wrote:

We think there are relatively few real-world cognitive tasks in which human limitations are so telling that AI is able to blow past human performance (as AI does in chess). ... Concretely, we propose two such areas: forecasting and persuasion. We predict that AI will not be able to meaningfully outperform trained humans (particularly teams of humans and especially if augmented with simple automated tools) at forecasting geopolitical events (say elections). We make the same prediction for the task of persuading people to act against their own self-interest.

You can read his full response to this in Section 3B of his essay, but in short it focuses on human biological limits:

Humans gained their abilities through thousands of years of evolution in the African savanna. There was no particular pressure in the savanna for “get exactly the highest Brier score possible in a forecasting contest”, and there is no particular reason to think humans achieved this. Indeed, if the evidence for human evolution for higher intelligence in the past 10,000 years in response to agriculture proves true, humans definitely didn’t reach the cosmic maximum on the African savannah. Why should we think this last, very short round of selection got it exactly right?

But rejecting a biological conception of human abilities is a key point of departure for us, something we take pains to describe in detail in the section “Human Ability Is Not Constrained by Biology”. That’s the problem with discussion across worldviews: If you take a specific statement, ignore the premises and terminological clarifications leading up to it, and interpret it in your worldview, it will seem like your opponent is clueless. Does Alexander think we are suggesting that if a savanna-dweller time traveled to the present, they would be able to predict elections?

He does emphasize that human performance is not fixed, but somehow sees this as a refutation of our thesis (rather than central to it). Perhaps the confusion arose because of our hypothesis that human performance at forecasting is close to the “irreducible error”. We don’t imply that the irreducible error of forecasting is a specific number that is fixed for all time. Of course it depends on the data that is available — better polling leads to better forecasting — and training that helps take advantage of increased data. And some of that training might even be the result of AI-enabled research on forecasting. We emphasize in our original essay that human intelligence is special not because of our biology, but because of our (contingent) mastery of our tools including AI. Thus, advances in AI will often improve human intelligence (abilities), and have the potential to improve the performance of both sides of the human-AI comparison we propose.

The point of our hypothesis is a simple one: We don’t think forecasting is like chess, where loads of computation can give AI a decisive speed advantage. The computational structure of forecasting is relatively straightforward, even though performance can be vastly improved through training and data. Thus, relatively simple computational tools in the hands of suitably trained teams of expert forecasters can squeeze (almost) all the juice there is to squeeze.

We are glad that Alexander’s response credits us with “putting their money where their mouth is on the possibility of mutual cooperation”. The sentiment is mutual. We look forward to continuing that cooperation, which we see as more productive than Substack rebuttals and counter-rebuttals.

Reaping AI’s benefits will require hard work and painful choices

There are two broad sets of implications of our framework: one for the economy and labor, and the other for safety. Once we get past a few basic premises (notably, that superintelligence is incoherent or impossible depending on how it is defined), our arguments behind these two sets of implications are largely different.

On economic impacts, our case is broadly that diffusion barriers won’t be overcome through capability improvement. As for safety, our case is primarily that achieving AI control without alignment is not only possible, it doesn’t even seem particularly hard, and doesn’t require scientific breakthroughs.

Since these two sets of arguments don’t overlap much, it is coherent to accept one set but not accept (or be ambivalent to) the other. Indeed, our view of the economic impacts seems to have resonated particularly strongly with readers. Since the essay’s publication, we have had many discussions with people responsible for AI strategy in various industries. We discovered that the way that they had been thinking about AI was consistent with ours, but they were starting to second-guess their approach because of all the hype. Our essay provided a coherent framework that backed up their intuitions as well as their observations in the trenches.4

While people deploying AI have a keen understanding of the difference between technology development and diffusion, our framework further divides each of those into two steps. On the development side, we emphasize the gap between models and products, or capabilities and applications. On the diffusion side, we differentiate between user learning curves and other aspects of adaptation by individuals, and the structural, organizational, or legal changes that might be necessary, which often require collective action. We illustrate the kinds of speed limits that operate at each of the four stages.

While user behaviors at least tend to change slowly but predictably, solving coordination problems or reforming sclerotic institutions — which are also prerequisites for effective technology adoption — are much more uncertain. As an example, consider how Air Traffic Control is stuck with technology from the middle of the 20th century despite the enormous costs of the lack of modernization becoming apparent.

While our essay pointed out that analogous diffusion barriers exist in the case of AI, we are only now doing the work of spelling out those barriers and identifying specific reforms that might be necessary. We will be writing more on this front, some of it in collaboration with Justin Curl.

It is worth bearing in mind that advanced AI is entering a world that is already both highly technological and highly regulated. We repeatedly find that the parts of workflows that AI tackles are unlikely to be bottlenecks, because much of the available productivity gains have already been unlocked through earlier waves of technologies. Meanwhile the actual bottlenecks prove resistant due to regulation or other external constraints. In many specific domains including legal services and scientific research, competitive dynamics are so strong that productivity gains from AI lead to escalation of arms races that don’t ultimately translate to societal value.

The surreal debate about the speed of diffusion

We’ve mentioned a few times that different camps disagree on how they characterize current AI impacts. Nowhere is this more apparent than the speed of diffusion. AI boosters believe that it is being adopted unprecedentedly rapidly. We completely disagree. Worse, as more evidence comes out, each side seems to be getting more certain of their interpretation.

We are working on an in-depth analysis of the speed of diffusion. For now, we point out a few basic fallacies in the common arguments and stats that get trotted out to justify the “rapid adoption” interpretation.

First, deployment is not diffusion. Often, when people talk about rapid adoption they just mean that when capabilities are developed, they can be near-instantly deployed to products (such as chatbots) that are used by hundreds of millions of users.

But this is not what diffusion means. It is not enough to know how many people have access to capabilities: What matters is how many people are actually using them, how long they are using them, and what they are using them for. When we drill down into those details, the picture looks very different.

For example, almost a year after the vaunted release of “thinking” models in ChatGPT, less than 1% of users used them on any given day! We find no pleasure in pointing this out even though it supports our thesis. As enthusiastic early adopters of AI, this kind of number is so low that it is hard for us to intuitively grasp, and frankly pretty depressing.

Another example of a misleading statistic relates to the fraction of workers in certain high-risk domains who use AI. Such statistics tend to be offered in service of the claim that AI is being rapidly adopted in risky ways. But even in high risk domains most tasks are actually mundane, and when we dig into the specific uses don’t seem that risky at all.

For example, a survey by the American Medical Association reported that a majority of doctors are using AI. But this includes things like transcription of dictated notes.5 It also includes things like asking a chatbot for a second opinion of a diagnosis (about 12% reported this use case in 2024, a whopping 1 percentage point increase from 11% in 2023). This is definitely a more serious use than transcription one, but it is still well-founded. As we’ve pointed out before, even unreliable AI is very helpful for error detection.

Increasing adoption of AI for these tasks does not mean that doctors are about to start YOLOing it and abdicating their responsibility to their patients by delegating their decisions to ChatGPT. The vast majority of doctors understand the difference between these two types of uses, and there are many overlapping guardrails preventing widespread irresponsible use in the medical profession, including malpractice liability, professional codes, and regulation of medical devices.

The most misleading “rapid adoption” meme of all might be this widely shared chart, showing that ChatGPT reached 100M users in about two months:

It compares ChatGPT user growth with (1) Instagram, Facebook, and Twitter, which are social media apps whose usefulness depends on network effects, and therefore characteristically grow much slower than apps that are useful from day one (2) Spotify, an app that was initially invite-only and (3) Netflix, a service that launched with a limited inventory and required a subscription to use.6

What is reflected in this chart are early adopters who will check out an app if there’s buzz around it, and there was deafening buzz around ChatGPT. Once you exhaust these curious early users, the growth curve looks very different. In fact, a year later, ChatGPT had apparently only grown from 100M to 200M users, which meant that the curve evidently bent sharply to the right. That is conveniently not captured in this graph which reflects only the first two months.

This chart would be useful if it gave us any evidence that the usual barriers to diffusion have been weakened or eliminated. It doesn’t. Two months is not enough time for the hard parts of diffusion to even get started, such as users adapting their workflows to productively incorporate AI. As such, this chart is irrelevant to any meaningful discussion of the speed of diffusion.

There are many other problems with this chart, but we’ll stop here.7 Again, this is far from a complete analysis of the speed of AI diffusion — that’s coming. For now, we’re just making the point that the majority of the commentary on this topic is simply unserious. And if this is what the discourse is like on a question for which we do have data, it is no surprise that predictions of the future from different camps bear no resemblance to each other.

Why AI adoption hits different

If the “rapid diffusion” meme is so wrong, why is it so pervasive and persistent? Because AI adoption feels like a tsunami in a way that the PC or the internet or social media never did. When people are intuitively convinced of something, they will be much less skeptical of data or charts that purport to confirm that feeling.

We recognize the feeling, of course. Our own lived experience of AI is different from past waves of technologies. Initially, we dismissed this as a cognitive bias. Whatever change we’re living through at the moment will feel like a much bigger shift than something we successfully adapted to in the past.

We now realize that we were wrong. The cognitive bias might be a small part of the explanation, but there is a genuine reason why AI adoption feels much more rapid and scary. In short, while it’s true that deployment is not diffusion, in the past, gradual deployment meant that users were somewhat buffered from having to constantly make decisions about adoption, but now that buffer has been swept away. Let’s explain with a comparison to internet adoption.

Those of us who adopted dial-up internet in the 90s will remember a story that went something like this. When we first heard about the tech, we were put off by the high price of a PC. Gradually those prices came down. Meanwhile we got some experience using the internet at work or at a friend’s house. So when we bought a PC and dialup internet a few years later, we already had some training. At first, dialup was slow and expensive and there weren’t even that many websites, so we didn’t use the internet that much. Gradually prices came down, bandwidth improved, and more content came online, and we learned how to use the internet productively and safely in tandem with our increasing use.

Adopting general-purpose AI tools in the 2020s is a radically different experience because deployment of new capabilities is instantaneous. People have to spend a much higher percentage of time evaluating whether to adopt AI for some particular use case, and are constantly being told that if they don’t adopt it they will be left behind.

All our earlier points stand — learning curves exist, human behavior takes a long time to change, and organizational change even longer. But not using AI is somewhat of an active choice, and people no longer have the excuse of not to think about it because they don’t yet have access.

In short, deployment is only one of many steps in diffusion, and removing that bottleneck probably made diffusion slightly faster. But it feels dramatically faster because as soon as one hears about a particular AI use case, one has to decide whether to adopt it or not, even if it is ultimately the case that the vast majority of the time, people are deciding not to, for various reasons that might be rational or irrational.

Concluding thoughts

One thing on which we definitely agree with AI boosters is that AI is not going away, nor will it become a niche like crypto that most people can ignore. Now that the collective initial shock of generative AI has worn off, there’s a need for structured ways to think about how AI’s impacts might unfold, instead of (over)reacting to each new technical capability or emergent social effect.

The AI-as-normal-technology framework — which we continue to elaborate in this newsletter — is one such approach. It is worth being familiar with, at least as an articulation of a historically grounded default way to think about tech’s societal impact, against which more exceptionalist accounts can be compared. The framework has some degree of actionable guidance for business leaders, workers, students, people concerned about AI safety or AI ethics, and policymakers, among others. We hope you follow along and contribute to the discussion.

We are grateful to Steve Newman and Felix Chen for feedback on a draft.

Could AI slow science?

Sayash Kapoor — Wed, 16 Jul 2025 21:35:56 GMT

AI leaders have predicted that it will enable dramatic scientific progress: curing cancer, doubling the human lifespan, colonizing space, and achieving a century of progress in the next decade. Given the cuts to federal funding for science in the U.S., the timing seems perfect, as AI could replace the need for a large scientific workforce.

It’s a common-sense view, at least among technologists, that AI will speed science greatly as it gets adopted in every part of the scientific pipeline — summarizing existing literature, generating new ideas, performing data analyses and experiments to test them, writing up findings, and performing “peer” review.

But many early common-sense predictions about the impact of a new technology on an existing institution proved badly wrong. The Catholic Church welcomed the printing press as a way of solidifying its authority by printing Bibles. The early days of social media led to wide-eyed optimism about the spread of democracy worldwide following the Arab Spring.

Similarly, the impact of AI on science could be counterintuitive. Even if individual scientists benefit from adopting AI, it doesn’t mean science as a whole will benefit. When thinking about the macro effects, we are dealing with a complex system with emergent properties. That system behaves in surprising ways because it is not a market. It is better than markets at some things, like rewarding truth, but worse at others, such as reacting to technological shocks. So far, on balance, AI has been an unhealthy shock to science, stretching many of its processes to the breaking point.

Any serious attempt to forecast the impact of AI on science must confront the production-progress paradox. The rate of publication of scientific papers has been growing exponentially, increasing 500 fold between 1900 and 2015. But actual progress, by any available measure, has been constant or even slowing. So we must ask how AI is impacting, and will impact, the factors that have led to this disconnect.

Our analysis in this essay suggests that AI is likely to worsen the gap. This may not be true in all scientific fields, and it is certainly not a foregone conclusion. By carefully and urgently taking actions such as those we suggest below, it may be possible to reverse course. Unfortunately, AI companies, science funders, and policy makers all seem oblivious to what the actual bottlenecks to scientific progress are. They are simply trying to accelerate production, which is like adding lanes to a highway when the slowdown is actually caused by a toll booth. It’s sure to make things worse.

1. Science has been slowing — the production-progress paradox

2. Why is progress slowing? Can AI help?

3. Science is not ready for software, let alone AI

4. AI might prolong the reliance on flawed theories

5. Human understanding remains essential

6. Implications for the future of science

7. Final thoughts

Science has been slowing — the production-progress paradox

The total number of published papers is increasing exponentially, doubling every 12 years. The total number of researchers who have authored a research paper is increasing even more quickly. And between 2000 and 2021, investment in research and development increased fourfold across the top seven funders (the US, China, Japan, Germany, South Korea, the UK, and France).1

But does this mean faster progress? Not necessarily. Some papers lead to fundamental breakthroughs that change the trajectory of science, while others make minor improvements to known results.

Genuine progress results from breakthroughs in our understanding. For example, we understood plate tectonics in the middle of the last century — the idea that the continents move. Before that, geologists weren’t even able to ask the right questions. They tried to figure out the effects of the cooling of the Earth, believing that that’s what led to geological features such as mountains. No amount of findings or papers in older paradigms of geology would have led to the same progress that plate tectonics did.

So it is possible that the number of papers is increasing exponentially while progress is not increasing at the same rate, or is even slowing down. How can we tell if this is the case?

One challenge in answering this question is that, unlike the production of research, progress does not have clear, objective metrics. Fortunately, an entire research field — the "science of science", or metascience — is trying to answer this question. Metascience uses the scientific method to study scientific research. It tackles questions like: How often can studies be replicated? What influences the quality of a researcher's work? How do incentives in academia affect scientific outcomes? How do different funding models for science affect progress? And how quickly is progress really happening?

Left: The number of papers authored and authors of research papers have been increasing exponentially (from Dong et al., redrawn to linear scale using a web plot digitizer). Right: The disruptiveness of papers is declining over time (from Park et al.).

Strikingly, many findings from metascience suggest that progress has been slowing down, despite dramatic increases in funding, the number of papers published, and the number of people who author scientific papers. We collect some evidence below; Matt Clancy reviews many of these findings in much more depth.

1) Park et al. find that "disruptive" scientific work represents an ever-smaller fraction of total scientific output. Despite an exponential increase in the number of published papers and patents, the number of breakthroughs is roughly constant.

2) Research that introduces new ideas is more likely to coin new terms. Milojevic collects the number of unique phrases used in titles of scientific papers over time as a measure of the “cognitive extent” of science, and finds that while this metric increased up until the early 2000s, it has since entered a period of stagnation, when the number of unique phrases used in titles of research papers has gone down.

3) Patrick Collison and Michael Nielsen surveyed researchers across fields on how they perceived progress in the most important breakthroughs in their fields over time — those that won a Nobel prize. They asked scientists to compare Nobel-prize-winning research from the 1910s to the 1980s.

They found that scientists considered advances from earlier decades to be roughly as important as the ones from more recent decades, across Medicine, Physics, and Chemistry. Despite the vast increases in funding, published papers, and authors, the most important breakthroughs today are about as impressive as those in the decades past.

4) Matt Clancy complements this with an analysis of what fraction of discoveries that won a Nobel Prize in a given year were published in the preceding 20 years. He found that this number dropped from 90% in 1970 to 50% in 2015, suggesting that either transformative discoveries are happening at a slower pace, or that it takes longer for discoveries to be recognized as transformative.

Share of papers describing each year’s Nobel-prize winning work that were published in the preceding 20 years. 10-year moving average. Source: Clancy based on data from Li et al.

5) Bloom et al. analyze research output from an economic perspective. Assuming that economic growth ultimately comes from new ideas, the constant or declining rate of growth implies that the exponential increase in the number of researchers is being offset by a corresponding decline in the output per researcher. They find that this pattern holds true when drilling down into specific areas, including semiconductors, agriculture, and medicine (where the progress measures are Moore’s law, crop yield growth, and life expectancy, respectively).

The decline of research productivity. Note that economists use “production” as a catch-all term, with paper and patent counts, growth, and other metrics being different ways to measure it. We view production and progress as fundamentally different constructs, so we use the term production in a narrower sense. Keep in mind that in the figure, “productivity” isn’t based on paper production but on measures that are better viewed as progress measures. Source: Bloom et al.

Of course, there are shortcomings in each of the metrics above. This is to be expected: since progress doesn't have an objective metric, we need to rely on proxies for measuring it, and these proxies will inevitably have some flaws.

For example, Park et al. used citation patterns to flag papers as "disruptive": if follow-on citations to a given paper don't also cite the studies this paper cited, the paper is more likely to be considered disruptive. One criticism of the paper is that this could simply be a result of how citation practices have evolved over time, not a result of whether a paper is truly disruptive. And the metric does flag some breakthroughs as non-disruptive — for example, AlphaFold is not considered a disruptive paper by this metric.2

But taken together, the findings do suggest that scientific progress is slowing down, at least compared to the volume of papers, researchers, and resources. Still, this is an area where further research would be fruitful — while the decline in the pace of progress relative to inputs seems very clear, it is less clear what is happening at an aggregate level. Furthermore, there are many notions of what the goals of science are and what progress even means, and it is not clear how to connect the available progress measures to these higher-level definitions.

Summary of a few major lines of evidence of the slowdown in scientific progress

Why is progress slowing? Can AI help?

There are many hypotheses for why progress could be slowing. One set of hypotheses is that slowdown is an intrinsic feature of scientific progress, and is what we should expect. For example, there’s the low-hanging fruit hypothesis — the easy scientific questions have already been answered, so what remains to be discovered is getting harder.

This is an intuitively appealing idea. But we don’t find this convincing. Adam Mastroianni gives many compelling counter-arguments. He points out that we’ve been wrong about this over and over and lists many comically mis-timed assessments of scientific fields reaching saturation just before they ended up undergoing revolutions, such as physics in the 1890s.

While it’s true that lower-hanging fruits get picked first, there are countervailing factors. Over time, our scientific tools improve and we stand on the tower of past knowledge, making it easier to reach higher. Often, the benefits of improved tools and understanding are so transformative that whole new fields and subfields are created. New fields from the last 50-100 years include computer science, climate science, cognitive neuroscience, network science, genetics, molecular biology, and many others. Effectively, we’re plucking fruit from new trees, so there is always low-hanging fruit.

In our view, the low-hanging fruit hypothesis can at best partly explain slowdowns within fields. So it’s worth considering other ideas.

The second set of hypotheses is less fatalistic. They say that there’s something suboptimal about the way we’ve structured the practice of science, and so the efficiency of converting scientific inputs into progress is dropping. In particular, one subset of hypotheses flags the increase in the rate of production itself as the causal culprit — science is slowing down because it is trying to go too fast.

How could this be? The key is that any one scientist’s attention is finite, so they can only pay attention to a limited number of papers every year. So it is too risky for authors of papers to depart from the canon. Any such would-be breakthrough papers would be lost in the noise and won’t get the attention of a critical mass of scholars. The greater the rate of production, the more the noise, so the less attention truly novel papers will achieve, and thus will be less likely to break through into the canon.

Chu and Evans explain:

when the number of papers published each year grows very large, the rapid flow of new papers can force scholarly attention to already well-cited papers and limit attention for less-established papers—even those with novel, useful, and potentially transformative ideas. Rather than causing faster turnover of field paradigms, a deluge of new publications entrenches top-cited papers, precluding new work from rising into the most-cited, commonly known canon of the field.
These arguments, supported by our empirical analysis, suggest that the scientific enterprise’s focus on quantity may obstruct fundamental progress. This detrimental effect will intensify as the annual mass of publications in each field continues to grow

Another causal mechanism relates to scientists’ publish-or-perish incentives. Production is easy to measure, and progress is hard to measure. So universities and other scientific institutions judge researchers based on measurable criteria such as how many papers they publish and the amount of grant funding they receive. It is not uncommon for scientists to have to publish a certain number of peer-reviewed papers to be hired or to get tenure (either due to implicit norms or explicit requirements).

The emphasis on production metrics seems to be worsening over time. Physics Nobel winner Peter Higgs famously noted that he wouldn't even have been able to get a job in modern academia because he wouldn't be considered productive enough.

So individual researchers' careers might be better off if they are risk averse, but it might reduce the collective rate of progress. Rzhetsky et al. find evidence of this phenomenon in biomedicine, where experiments tend to focus too much on experimenting with known molecules that are already considered important (which would be more likely to lead to publishing a paper) rather than more risky experiments that could lead to genuine breakthroughs. Worryingly, they find this phenomenon worsening over time.

This completes the feedback loop: career incentives lead to researchers publishing more papers, and disincentivize novel research that results in true breakthroughs (but might only result in a single paper after years of work).

If slower progress is indeed being caused by faster production, how will AI impact it? Most obviously, automating parts of the scientific process will make it even easier for scientists to chase meaningless productivity metrics. AI could make individual researchers more creative but decrease the creativity of the collective because of a homogenizing effect. AI could also exacerbate the inequality of attention and make it even harder for new ideas to break through. Existing search technology, such as Google Scholar, seems to be having exactly this effect.

To recap, so far we’ve argued that if the slowdown in science is caused by overproduction, AI will make it worse. In the next few sections, we’ll discuss why AI could worsen the slowdown regardless of what’s causing it.

Science is not ready for software, let alone AI

How do researchers use AI? In many ways: AI-based modeling to uncover trends in data using sophisticated pattern-matching algorithms; hand-written machine learning models specified based on expert knowledge; or even generative AI to write the code that researchers previously wrote. While some applications, such as using AI for literature review, don't involve writing code, most applications of AI for science are, in essence, software development.

Unfortunately, scientists are notoriously poor software engineers. Practices that are bog-standard in the industry, like automated testing, version control, and following programming design guidelines, are largely absent or haphazardly adopted in the research community. These are practices that were developed and standardized over the last six decades of software engineering to prevent bugs and ensure the software works as expected.

Worse, there is little scrutiny of the software used in scientific studies. While peer review is a long and arduous step in publishing a scientific paper, it does not involve reviewing the code accompanying the paper, even though most of the "science" in computational research is being carried out in the code and data accompanying a paper, and only summarized in the paper itself.

In fact, papers often fail to even share the code and data used to generate results, so even if other researchers are willing to review the code, they don't have the means to. Gabelica et al. found that of 1,800 biomedical papers that pledged to share their data and code, 93% did not end up sharing these artifacts. This even affects results in the most prominent scientific journals: Stodden et al. contacted the authors of 204 papers published in Science, one of the top scientific journals, to get the code and data for their study. Only 44% responded.

When researchers do share the code and data they used, it is often disastrously wrong. Even simple tools, like Excel, have notoriously led to widespread errors in various fields. A 2016 study found that one in five genetics papers suffer from Excel-related errors, for example, because the names of genes (say, Septin 2) were automatically converted to dates (September 2). Similarly, it took decades for most scientific communities to learn how to use simple statistics responsibly.

AI opens a whole new can of worms. The AI community often advertises AI as a silver bullet without realizing how difficult it is to detect subtle errors. Unfortunately, it takes much less competence to use AI tools than to understand them deeply and learn to identify errors. Like other software-based research, errors in AI-based science can take a long time to uncover. If the widespread adoption of AI leads to researchers spending more time and effort conducting or building on erroneous research, it could slow progress, since researcher time and effort are wasted in unproductive research directions.

Unfortunately, we've found that AI has already led to widespread errors. Even before generative AI, traditional machine learning led to errors in over 600 papers across 30 scientific fields. In many cases, the affected papers constituted the majority of the surveyed papers, raising the possibility that in many fields, the majority of AI-enabled research is flawed. Others have found that AI tools are often used with inappropriate baseline comparisons, making it incorrectly seem like they outperform older methods. These errors are not just theoretical: they affect the potential real-world deployment of AI too. For example, Roberts et al. found that of 400+ papers using AI for COVID-19 diagnosis, none produced clinically useful tools due to methodological flaws.

Applications of generative AI can result in new types of errors. For example, while AI can aid in programming, code generated using AI often has errors. As AI adoption increases, we will discover more applications of AI for science. We suspect we'll find widespread errors in many of these applications.

Why is the scientific community so far behind software engineering best practices? In engineering applications, bugs are readily visible through tests, or in the worst case, when they are deployed to customers. Companies have strong incentives to fix errors to maintain the quality of their applications, or else they will lose market share. As a result, there is a strong demand for software engineers with deep expertise in writing good software (and now, in using AI well). This is why software engineering practices in the industry are decades ahead of those in research. In contrast, there are few incentives to correct flawed scientific results, and errors often persist for years.

That is not to say science should switch from a norms-based to a market-based model. But it shouldn't be surprising that there are many problems markets have solved that science hasn't — such as developing training pipelines for software engineers. Where such gaps between science and the industry emerge, scientific institutions need to intentionally adopt industry best practices to ensure science continues to innovate, without losing what makes science special.

In short, science needs to catch up to a half century of software engineering — fast. Otherwise, its embrace of AI will lead to an avalanche of errors and create headwinds, not tailwinds for progress.

AI could help too. There are many applications of AI to spot errors. For example, the Black Spatula project and the YesNoError project use AI to uncover flaws in research papers. In our own work, we've developed benchmarks aiming to spur the development of AI agents that automatically reproduce papers. Given the utility of generative AI for writing code, AI itself could be used to improve researchers' software engineering practices, such as by providing feedback, suggestions, best practices, and code reviews at scale. If such tools become reliable and see widespread adoption, AI could be part of the solution by helping avoid wasted time and effort building on erroneous work. But all of these possibilities require interventions from journals, institutions, and funding agencies to incentivize training, synthesis, and error detection rather than production alone.

AI might prolong the reliance on flawed theories

One of the main uses of AI for science is modeling. Older modeling techniques required coming up with a hypothesis for how the world works, then using statistical models to make inferences about this hypothesis.

In contrast, AI-based modeling treats this process as a black box. Instead of making a hypothesis about the world and improving our understanding based on the model's results, it simply tries to improve our ability to predict what outcomes would occur based on past data.

Leo Breiman illustrated the differences between these two modeling approaches in his landmark paper "Statistical Modeling: The Two Cultures". He strongly advocated for AI-based modeling, often on the basis of his experience in the industry. A focus on predictive accuracy is no doubt helpful in the industry. But it could hinder progress in science, where understanding is crucial.

Why? In a recent commentary in the journal Nature, we illustrated this with an analogy to the geocentric model of the Universe in astronomy. The geocentric model of the Universe—the model of the Universe with the Earth at the center—was very accurate at predicting the motion of planets. Workarounds like "epicycles" made these predictions accurate. (Epicycles were the small circles added to the planet's trajectory around the Earth).

Whenever a discrepancy between the model's predictions and the experimental readings was observed, astronomers added an epicycle to improve the model's accuracy. The geocentric model was so accurate at predicting planets' motions that many modern planetariums still use it to compute planets' trajectories.

Left: The geocentric model of the Universe eventually became extremely complex due to the large number of epicycles. Right: The heliocentric model was far simpler.

How was the geocentric model of the Universe overturned in favor of the heliocentric model — the model with the planets revolving around the Sun? It couldn't be resolved by comparing the accuracy of the two models, since the accuracy of the models was similar. Rather, it was because the heliocentric model offered a far simpler explanation for the motion of planets. In other words, advancing from geocentrism to heliocentrism required a theoretical advance, rather than simply relying on the more accurate model.

This example shows that scientific progress depends on advances in theory. No amount of improvements in predictive accuracy could get us to the heliocentric model of the world without updating the theory of how planets move.

Let's come back to AI for science. AI-based modeling is no doubt helpful in improving predictive accuracy. But it doesn't lend itself to an improved understanding of these phenomena. AI might be fantastic at producing the equivalents of epicycles across fields, leading to the prediction-explanation fallacy.

In other words, if AI allows us to make better predictions from incorrect theories, it might slow down scientific progress if this results in researchers using flawed theories for longer. In the extreme case, fields would be stuck in an intellectual rut even as they excel at improving predictive accuracy within existing paradigms.

Could advances in AI help overcome this limitation? Maybe, but not without radical changes to modeling approaches and technology, and there is little incentive for the AI industry to innovate on this front. So far, improvements in predictive accuracy have greatly outpaced improvements in the ability to model the underlying phenomena accurately.

Prediction without understanding: Vafa et al. show that a transformer model trained on 10 million planetary orbits excels at predicting orbits without figuring out the underlying gravitational laws that produce those orbits.

Human understanding remains essential

In solving scientific problems, scientists build up an understanding of the phenomena they study. It might seem like this understanding is just a way to get to the solution. So if we can automate the process of going from problem to solution, we don’t need the intermediate step.

The reality is closer to the opposite. Solving problems and writing papers about them can be seen as a ritual that leads to the real prize, human understanding, without which there can be no scientific progress.

Fields Medal-winning mathematician William Thurston wrote an essay brilliantly illustrating this. At the outset, he emphasizes that the point of mathematics is not simply to figure out the truth value for mathematical facts, but rather the accompanying human understanding:

…what [mathematicians] are doing is finding ways for people to understand and think about mathematics.
The rapid advance of computers has helped dramatize this point, because computers and people are very different. For instance, when Appel and Haken completed a proof of the 4-color map theorem using a massive automatic computation, it evoked much controversy. I interpret the controversy as having little to do with doubt people had as to the veracity of the theorem or the correctness of the proof. Rather, it reflected a continuing desire for human understanding of a proof, in addition to knowledge that the theorem is true.
On a more everyday level, it is common for people first starting to grapple with computers to make large-scale computations of things they might have done on a smaller scale by hand. They might print out a table of the first 10,000 primes, only to find that their printout isn't something they really wanted after all. They discover by this kind of experience that what they really want is usually not some collection of "answers"—what they want is understanding. [emphasis in original]

He then describes his experience as a graduate student working on the theory of foliations, a center of attention among many mathematicians. After he proved a number of papers on the most important theorems in the field, counterintuitively, people began to leave the field:

I heard from a number of mathematicians that they were giving or receiving advice not to go into foliations—they were saying that Thurston was cleaning it out. People told me (not as a complaint, but as a compliment) that I was killing the field. Graduate students stopped studying foliations, and fairly soon, I turned to other interests as well.
I do not think that the evacuation occurred because the territory was intellectually exhausted—there were (and still are) many interesting questions that remain and that are probably approachable. Since those years, there have been interesting developments carried out by the few people who stayed in the field or who entered the field, and there have also been important developments in neighboring areas that I think would have been much accelerated had mathematicians continued to pursue foliation theory vigorously.
Today, I think there are few mathematicians who understand anything approaching the state of the art of foliations as it lived at that time, although there are some parts of the theory of foliations, including developments since that time, that are still thriving.

Two things led to this desertion. First, the results he documented were written in a way that was hard to understand. This discouraged newcomers from entering the field. Second, even though the point of mathematics is building up human understanding, the way mathematicians typically get credit for their work is by proving theorems. If the most prominent results in a field have already been proven, that leaves few incentives for others to understand a field's contributions, because they can't prove further results (which would ultimately lead to getting credit).

In other words, researchers are incentivized to prove theorems. More generally, researchers across fields are incentivized to find solutions to scientific problems. But this incentive only leads to progress because the process of proving theorems or finding solutions to problems also leads to building human understanding. As the desertion of work on foliations shows, when there is a mismatch between finding solutions to problems and building human understanding, it can result in slower progress.

This is precisely the effect AI might have: by solving open research problems without leading to the accompanying understanding, AI could erode these useful byproducts by reducing incentives to build understanding. If we use AI to short circuit this process of understanding, that is like using a forklift at the gym. You can lift heavier weights with it, sure, but that's not why you go to the gym.

AI could short circuit the process of building human understanding, which is essential to scientific progress

Of course, mathematics might be an extreme case, because human understanding is the end goal of (pure) mathematics, not simply knowing the truth value of mathematical statements. This might not be the case for many applications of science, where the end goal is to make progress towards a real-world outcome rather than human understanding, say, weather forecasting or materials synthesis.

Most fields lie in between these two extremes. If we use AI to bypass human understanding, or worse, retain only illusions of understanding, we might lose the ability to train new scientists, develop new theories and paradigms, synthesize and correct results, apply knowledge beyond science, or even generate new and interesting problems.

Empirical evidence across scientific fields has found evidence for some of these effects. For example, Hao et al. collect data from six fields and find that papers that adopt AI are more likely to focus on providing solutions to known problems and working within existing paradigms rather than generating new problems.

Of course, AI can also be used to build up tacit knowledge, such as by helping people understand mathematical proofs or other scientific knowledge. But this requires fundamental changes to how science is organized. Today's career incentives and social norms prize solutions to scientific problems over human understanding. As AI adoption accelerates, we need changes to incentives to make sure human understanding is prioritized.

Implications for the future of science

Over the last decade, scientists have been in a headlong rush to adopt AI. The speed has come at the expense of any ability to adapt slow-moving scientific institutional norms to maintain quality control and identify and preserve what is essentially human about science. As a result, the trend is likely to worsen the production-progress paradox, accelerating paper publishing but only digging us deeper into the hole with regard to true scientific progress.

The number of papers that use AI quadrupled across 20 fields between 2012 and 2022 — even before the adoption of large language models. Figure by Duede et al.

So, what should the scientific community do differently? Let’s talk about the role of individual researchers, funders, publishers and other gatekeepers, and AI companies.

Changing scientific practices

Individual researchers should be more careful when adopting AI. They should build software engineering skills, learn how to avoid a long and growing list of pitfalls in AI-based modeling, and ensure they don’t lose their expertise by using AI as a crutch or an oracle. Sloppy use of AI may help in the short run, but will hinder meaningful scientific achievement.

With all that said, we recognize that most individual researchers are rationally following their incentives (productivity metrics). Yelling at them is not going to help that much, because what we have are collective action problems. The actors with real power to effect change are journals, universities hiring & promotion committees, funders, policymakers, etc. Let’s turn to those next.

Investing in meta-science

Meta-science research has been extremely valuable in revealing the production-progress paradox. But so far, that finding doesn’t have a lot of analytical precision. There’s only the fuzzy idea that science is getting less bang for its buck. This finding is generally consistent with scientists’ vibes, and is backed by a bunch of different metrics that vaguely try to measure true progress. But we don’t have a clear understanding of what the construct (progress) even is, and we’re far from a consensus story about what’s driving the slowdown.

To be clear, we will never have One True Progress Metric. If we did, Goodhardt/Campbell’s law would kick in — “When a measure becomes a target, it ceases to be a good measure.” Scientists would start to furiously optimize it, just as we have done with publication and citation counts, and the gaming would render it useless as a way to track progress.

That said, there’s clearly a long way for meta-science to go in improving both our quantitative and (more importantly) our qualitative/causal understanding of progress and the slowdown. Meta-science must also work to understand the efficacy of solutions.

Despite recent growth, meta-science funding is a fraction of a percent of science funding (and research on the slowdown is only a fraction of that pie). If it is indeed true that science funding as a whole is getting orders of magnitude less bang for the buck than in the past, meta-science investment seems ruefully small.

Reforming incentives

Scientists constantly complain to each other about the publish-or-perish treadmill and are keenly aware that the production-focused reward structure isn’t great for incentivizing scientific progress. But efforts to change this have consistently failed. One reason is simple inertia. Then there’s the aforementioned Goodhart’s law — whatever new metric is instituted will quickly be gamed. A final difficulty is that true progress can only be identified retrospectively, on timescales that aren’t suitable for hiring and promotion decisions.

One silver lining is that as the cost of publishing papers further drops due to AI, it could force us to stop relying on production metrics. In the AI field itself, the effort required to write a paper is so low that we are heading towards a singularity, with some researchers being able to (co-)author close to 100 papers a year. (But, again, the perceived pace of actual progress seems mostly flat.) Other fields might start going the same route.

Thus, rewarding the publication of individual findings may simply not be an option for much longer. Instead, perhaps the kinds of papers that count toward career progress should be limited to things that are hard to automate, such as new theories or paradigms of scientific research. And such reforms to incentive structures should go hand-in-hand with shifts in funding.

One thing we don’t need is more incentives for AI adoption. As we explained above, it is already happening at breakneck speed, and is not the bottleneck.

Rethinking AI-for-science tools

When it comes to AI-for-science labs and scientific tools from big AI companies, the elephant in the room is that they are in it for the wrong reasons. They want flashy “AI discovers X!” headlines so that they can sustain the narrative that AI will solve humanity’s problems, which buys them favorable policy treatment. We are not holding our breath for this to change.

We should be skeptical of AI-for-science news headlines. Many of them are greatly exaggerated. The results may fail to reproduce, or AI may be framed as the main character when it was in fact one tool among many.

If there are any AI-for-science tool developers out there who actually want to help, here’s our advice. Target the actual bottlenecks instead of building yet another literature review tool. How about tools for finding errors in scientific code or other forms of quality control? Listen to the users. For example, mathematicians have repeatedly said that tools for improving human understanding are much more exciting than trying to automate theorem-proving, which they view as missing the point.

The way we evaluate AI-for-science tools should also change. Consider a literature review tool. There are three kinds of questions one can ask: Does it save a researcher time and produce results of comparable quality to existing tools? How does the use of the tool impact the researcher’s understanding of the literature compared to traditional search? What will the collective impacts on the community be if the tool were widely adopted — for example, will it further focus the community’s attention on already-famous papers? These three questions get at production, understanding, and progress respectively.

Currently, only the first question is considered part of what evaluation means. The latter two are out of scope, and there aren’t even established methods or metrics for such measurement. That means that AI-for-science evaluation is guaranteed to provide a highly incomplete and biased picture of the usefulness of these tools and minimize their potential harms.

Final thoughts

We ourselves are enthusiastic users of AI in our scientific workflows. On a day-to-day basis, it all feels very exciting. That makes it easy to forget that the impact of AI on science as an institution, rather than individual scientists, is a different question that demands a different kind of analysis. Writing this essay required fighting our own intuitions in many cases. If you are a scientist who is similarly excited about using these tools, we urge you to keep this difference in mind.

Our skepticism here has similarities and differences to our reasons for the slow timelines we laid out in AI as Normal Technology. In that paper, we explained that market mechanisms exert some degree of quality control, and many shoddy AI deployments have failed badly, forcing companies who care about their reputation to take it slow when deploying AI, especially for consequential tasks, regardless of how fast the pace of development is. But in science, adoption and quality control processes are decoupled, with the former being much faster.

We are optimistic that scientific norms and processes will catch up in the long run. But for now, it’s going to be a bumpy ride.

We are grateful to Eamon Duede for feedback on a draft of this essay.

AGI is not a milestone

Sayash Kapoor — Thu, 01 May 2025 11:47:59 GMT

With the release of OpenAI’s latest model o3, there is renewed debate about whether Artificial General Intelligence has already been achieved. The standard skeptic’s response to this is that there is no consensus on the definition of AGI. That is true, but misses the point — if AGI is such a momentous milestone, shouldn’t it be obvious when it has been built?

In this essay, we argue that AGI is not a milestone. It does not represent a discontinuity in the properties or impacts of AI systems. If a company declares that it has built AGI, based on whatever definition, it is not an actionable event. It will have no implications for businesses, developers, policymakers, or safety. Specifically:

Even if general-purpose AI systems reach some agreed-upon capability threshold, we will need many complementary innovations that allow AI to diffuse across industries to realize its productive impact. Diffusion occurs at human (and societal) timescales, not at the speed of tech development.
Worries about AGI and catastrophic risk often conflate capabilities with power. Once we distinguish between the two, we can reject the idea of a critical point in AI development at which it becomes infeasible for humanity to remain in control.
The proliferation of AGI definitions is a symptom, not the disease. AGI is significant because of its presumed impacts but must be defined based on properties of the AI system itself. But the link between system properties and impacts is tenuous, and greatly depends on how we design the environment in which AI systems operate. Thus, whether or not a given AI system will go on to have transformative impacts is yet to be determined at the moment the system is released. So a determination that an AI system constitutes AGI can only meaningfully be made retrospectively.

Nuclear weapons as an anti-analogy for AGI

Achieving AGI is the explicit goal of companies like OpenAI and much of the AI research community. It is treated as a milestone in the same way as building and delivering a nuclear weapon was the key goal of the Manhattan Project.

This goal made sense as a milestone in the Manhattan Project for two reasons. The first is observability. In developing nuclear weapons, there can be no doubt about whether you’re reached the goal or not — an explosion epitomizes observability. The second is immediate impact. The use of nuclear weapons contributed to a quick end to World War 2. It also ushered in a new world order — a long-term transformation of geopolitics.

Many people have the intuition that AGI will have these properties. It will be so powerful and humanlike that it will be obvious when we’ve built it. And it will immediately bring massive benefits and risks — automation of a big swath of the economy, a great acceleration of innovation, including AI research itself, and potentially catastrophic consequences for humanity from uncontrollable superintelligence.

In this essay, we argue that AGI will be exactly the opposite — it is unobservable because there is no clear capability threshold that has particular significance; it will have no immediate impact on the world; and even a long-term transformation of the economy is uncertain.

In previous essays, we have argued against the likely disastrous policy interventions that some have recommended by analogizing AGI to nuclear weapons. It is striking to us that this analogy reliably generates what we consider to be incorrect predictions and counterproductive recommendations.

It isn’t crazy to think that o3 is AGI, but this says more about AGI than o3

Many prominent AI commentators have called o3 a kind of AGI: Tyler Cowen says that if you know AGI when you see it, then he has seen it. Ethan Mollick describes o3 as a jagged AGI. What is it about o3 that has led to such excitement?

The key innovation in o3 is the use of reinforcement learning to learn to search the web and use tools as part of its reasoning chain.1 In this way, it can perform more complex cognitive tasks than LLMs are directly capable of, and can do so in a way that’s similar to people.

Consider a person doing comparison shopping. They might look at a few products, use the reviews of those products to get a better sense of what features are even important, and use that knowledge to iteratively expand or shrink the set of products being considered. o3 is a generalist agent that does a decent job at this sort of thing.

Let’s consider what this means for AGI. To avoid getting bogged down in the details of o3, imagine a future system whose architecture is identical to o3, but is much more competent. For example, it can always find the right webpages and knowledge for the task as long as it’s online, no matter how hard it is to locate. It can download and run code from the internet to solve a task if necessary. None of these require scientific breakthroughs, only engineering improvements and further training.

At the same time, without scientific improvements, the architecture imposes serious limits. For example, this future system cannot acquire new skills from experience, except through an explicit update to its training. Building AI systems that can learn on the fly is an open research problem.2

Would our hypothetical system be AGI? Arguably, yes. What many AGI definitions have in common is the ability to outperform humans at a wide variety of tasks. Depending on how narrowly the set of tasks is defined and how broadly the relevant set of humans for each task is defined, it is quite plausible that this future o3-like model/agent will meet some of these AGI definitions.

For example, it will be superhuman at playing chess, despite the fact that large language models themselves are at best mediocre at chess. Remember that the model can use tools, search the internet, and download and run code. If the task is to play chess, it will download and run a chess engine.

Despite human-level or superhuman performance at many tasks, and plausibly satisfying some definitions of AGI, it will probably fail badly at many real-world tasks. We’ll get back to the reasons for that.

Does any of this matter? It does. Leaders at AI companies have made very loud predictions and commitments to delivering AGI within a few years. There are enormous incentives for them to declare some near-future system to be AGI, and potentially enormous costs to not doing so. Perhaps some of the valuation of AI companies is based on these promises, so without AGI there might be a bubble burst. Being seen as leaders in AI development could help improve market share and revenues, and improve access to talent.

So, if and when companies claim to have built AGI, what will be the consequences? We'll analyze that in the rest of this essay.

AGI won't be a shock to the economy because diffusion takes decades

One argument for treating AGI as a milestone — and taking declarations of AGI seriously — is that AGI could lead to rapid economic impacts, both positive and negative, such as a world without scarcity, an end to the concept of money, or sudden mass joblessness.

But AI's economic impact is only realized when it is adopted across the economy. Technical advances are necessary, but not sufficient, to realize this impact. For past general-purpose technologies, such as electricity, computing, and the internet, it took decades for the underlying technical advances to diffuse across society. The miracle of the Industrial Revolution wasn't the high growth rate — annual growth rates averaged below 3% — but the sustained period of decades of growth.

There are many bottlenecks to the diffusion of AI: developing useful products and applications, training the workforce to utilize these products, implementing organizational changes to enable AI use, and establishing laws and norms that facilitate AI adoption by companies. Like past general-purpose technologies, we expect the economic impacts of AI to be realized over decades, as this process of diffusion unfolds.

In the paper AI as Normal Technology, we present a detailed argument for why we think this will be the case. The idea that rapid increases in capability lead to rapid economic impacts is completely inconsistent with the past and present of AI, and there is no reason to expect that this will change in the future.

One definition of AGI is AI systems that outperform humans at most economically valuable work. We might worry that if AGI is realized in this sense of the term, it might lead to massive, sudden job displacement. But humans are a moving target. As the process of diffusion unfolds and the cost of production (and hence the value) of tasks that have been automated decreases, humans will adapt and move to tasks that have not yet been automated. The process of technical advancements, product development, and diffusion will continue.

AGI will not lead to a rapid change in the world order

The US and China are often described as being in an AI arms race, with each country racing to build AGI. It is hypothesized that the country to build it first would have a decisive strategic advantage — resulting in dominance in the world order for the foreseeable future.3

This narrative doesn’t make sense because the knowledge required to create AI models, and model capabilities themselves, tend to proliferate quickly between countries. There are hundreds of thousands of AI technologists, and they work in the private sector rather than government labs, so it is not feasible to keep secrets at that scale.

Invention — in this case, AI model development — is overrated as a source of competitive advantage. We should expect technological developments to roughly keep pace across countries. Even though US companies are currently at the forefront, we shouldn't expect a lasting advantage.4

Many people haven't appreciated the ease of proliferation of technological capabilities — perhaps due to the nuclear weapons mental model — and have been caught by surprise. This is what led to the "DeepSeek moment" earlier this year, since analysts did not realize how quickly AI capabilities can proliferate, and as a result, were not expecting upstarts (especially from China) to catch up so quickly.

Some people argue that even an advantage of a few months will be critical. We disagree. The important question in the context of great power competition is not which country builds AGI first, but rather which country better enables diffusion. As Jeffrey Ding has shown, the effectiveness of companies and governments in actually utilizing AI inventions and innovations, whether domestic or international, to improve productivity is far more important for determining the economic impacts of general-purpose technologies.

While Chinese AI companies are at most 6-12 months behind leading US companies in terms of AI models and capabilities, China lags significantly behind the US in several key indicators that might enable diffusion: Digitization, cloud computing adoption, and workforce training. All of these are required to enable the productive diffusion of AI advances across industries. This is the actual source of American competitive advantage.

Of course, this could change in the coming years. But if it does, it will result from policy changes to promote diffusion rather than the development of AGI. And it is not something countries can accomplish overnight, regardless of how quickly they change policy. Diffusion typically unfolds over decades.

None of this means that policymakers should be complacent. But it does mean that rather than fixating on AGI, they should focus on enabling productive and safe diffusion, including of existing AI.

The long-term economic implications of AGI are uncertain

Even if it doesn’t have immediate economic impacts, could AGI unlock, say, 10% annual GDP growth that could add up to something big over a few decades?

Maybe. But it is far from clear why and how this will happen.

Historically, this kind of acceleration in growth has happened very few times — the industrial revolution had this effect, but not the internet, which barely had any impact on GDP. Note that even if you don’t think that GDP is the right thing to measure, a qualitative change in the GDP growth rate is a good proxy for whatever fundamental change in the economy you care about.

The problem is that accelerating growth requires eliminating bottlenecks to progress. That’s harder than most AI boosters assume. AI will likely have uneven effects across sectors, and long-term growth will be bottlenecked by the weakest sector.

Those arguing for dramatic effects often have an incorrect mental model of what the bottlenecks actually are. For example, while it is tempting to believe that cheap scientific innovation will unlock progress, the production of new findings is actually not the bottleneck in science.

More broadly, progress depends not just on the technology but on having the right preconditions — complementary innovations as well as cultural, economic, and political factors. If all it took to create the industrial revolution was the invention of steam power, the Roman Empire would have done it.

Our current laws, norms, institutions, and politics evolved in a time of much less technological potential. They are already choking opportunities for straightforward types of growth, such as building more public infrastructure. To reap the economic benefits that broad cognitive automation can potentially bring, the degree of structural change that needs to happen is unfathomably greater.

In conclusion, the extent and nature of long-term impacts from AGI remain to be seen and depend on what complementary actions we take. The long-term implications are not a property of AGI itself.

Misalignment risks of AGI conflate power and capability

On the flip side, AGI could be a turning point for AI's societal risks. Could it cause loss of control, massive societal harm, or even human extinction?

Discussions of AGI risks conflate power — the ability to modify the environment — with capability — the capacity to solve specified tasks correctly. Capability is an intrinsic property of an AI system, whereas power is a matter of how we design the environment in which AI systems operate. And humans have agency over this design. This distinction is often overlooked.

Consider Dario Amodei's definition of "powerful AI".5 He begins with a description of the capabilities of powerful AI, such as being able to "...prove unsolved mathematical theorems, write extremely good novels, write difficult codebases from scratch." This criterion is an example of AI capabilities, and we can discuss them at the level of AI systems.

But he then moves on to describe properties of the environment in which we allow AI systems to operate, which includes "...taking actions on the internet, taking or giving directions to humans, ordering materials, directing experiments, watching videos, making videos, and so on." This is an example of the power afforded to an AI system. It depends on the environment within which the AI system operates, which determines how AI capabilities are translated into power.

We do expect AI capabilities to keep increasing. But regardless of capability level, we can choose to ensure that AI remains a tool and is not given power and autonomy to operate without human oversight. In the AI as Normal Technology essay, we address all the usual counterarguments to this, including arms races among companies, power seeking, superhuman persuasion, deceptive alignment, and more.

We argue in the paper that there will be strong business incentives against deploying AI without adequate oversight, and that these incentives can and should be buttressed by regulation when necessary. This has historically been the case in areas ranging from self-driving cars to AI assistants. We don’t expect this trend to suddenly flip once AI capabilities reach a presumed tipping point that we arbitrarily designate as AGI.

AGI does not imply impending superintelligence

Yet another reason to consider AGI a milestone is the view that shortly after we build AGI, AI systems could recursively self-improve — AGI could train future versions of models that become far more capable, leading to an "intelligence explosion." Soon afterwards, we would get superintelligent AI (AI systems that far exceed human abilities on any conceivable task), leading to either utopia or dystopia, depending on how well superintelligent AI is “aligned” with human interests.

In the normal technology view, there are two big reasons to doubt this narrative. The first is that even if arbitrary speedups in AI methods are possible, we think that innovation and diffusion will happen at human speed, as summarized in this table.

Like other general-purpose technologies, the impact of AI is materialized not when methods and capabilities improve, but when those improvements are translated into applications and are diffused through productive sectors of the economy. [Source]

Second, the fact that AI would help conduct AI research does not imply that this process can be arbitrarily accelerated. AI is already used to automate a significant portion of AI research today. But there are many bottlenecks to progress in AI methods, such as the social nature of data collection and real-world interaction that might be required for achieving certain capabilities, computational and cost limits, or herding around popular or intuitive ideas while ignoring the ones that enable true breakthroughs.

We could be wrong about this, and recursive self-improvement could be possible, leading to unbounded speedups in progress in AI methods. And this might have some interesting implications, including some discontinuities in impact, even if widespread diffusion will be slower. For these reasons, it is important to have early warning systems for recursive self-improvement. However, this is not captured by AGI definitions. We might well have AGI while being far from recursive self-improvement, or vice versa.

We won’t know when AGI has been built

There are endless definitions of AGI and related concepts — Jasmine Sun has a useful compilation of over 20 definitions — but they fall into three broad categories: they can be based on the system’s impact on the world, its internals, or its behavior in controlled settings. We will show that each style of definition has a fatal flaw. It leads to criteria that are either too strict or too weak compared to what we ideally want.

Understanding these gaps also shows why people have different intuitions regarding what AGI will look like, and why “we’ll know it when we see it” has failed as a criterion and will continue to fail.

OpenAI's 2018 definition of AGI was "highly autonomous systems that outperform humans at most economically valuable work". From our perspective — our interest being in the impacts of AI — this definition is potentially very useful. If AI outperformed [all] humans at most economically valuable work, it would be unquestionably impactful.

But let’s be clear — this is not a property of an AI system. It is a property of the state of the world. It has at least as much to do with the complementary innovations that we make and the extent to which we choose to integrate AI into our organizations and institutions. It would be absurd to try to test an AI system in isolation in the lab and ask whether it outperforms people at their jobs. It is a category error.

For example, whether AI can (autonomously) outperform a medical researcher depends in part on whether we collectively choose to allow AI systems to perform large-scale medical experiments on people. We shouldn’t and we won’t, which means that irrespective of the systems' capabilities, they cannot perform the function of a medical researcher. This might be an extreme example, but similar bottlenecks arise in virtually every job.

Worse, until and unless we diffuse AI everywhere, we won’t even know if the system is theoretically capable of automating work in the real world. We won’t be able to set up sufficiently convincing simulacra of the messy complexity of the world. In short, impact-based definitions are not useful for practical purposes because they don’t give us a way to anticipate the end result of the painfully slow process of diffusion.

In contrast to researchers like us who are interested in impacts, many researchers are interested in humanlike AI in the sense of internals — does the system truly understand the world causally, can it reason, plan, and acquire new skills like we do, and so forth.

This sense of AGI has been notoriously hard to operationalize because of the difficulty of observing and characterizing AI internals. The Turing test is the best known of many attempts to use behavior as a proxy for the humanlike attributes we care about, but inevitably it turns out that we can build AI systems that pass such tests without having the hoped-for humanlike internals.

Furthermore, because of the jaggedness of AI — superhuman in many ways, yet lacking a toddler’s understanding of the world in other ways — the transformative effects of AI are likely to be felt long before it is fully humanlike (or superhuman) on all dimensions.

In short, since we are not interested in the internals for their own sake, we set aside this kind of definition.

That leaves us with the third kind of definition, which is by far the most common: those based on behavior, operationalized as benchmark performance. For example, the Metaculus question on “human-machine intelligence parity” is defined in terms of performance on exam questions in math, physics, and computer science. The problem with this kind of definition is well known, and we’ve discussed it repeatedly. They simply encourage hill climbing in the sense of building AI systems that can beat the benchmarks without necessarily being useful in the real world.

Pros and cons of the three kinds of AGI definitions

One response to the definitional challenge is to claim that we'll know AGI when we see it. The o3 launch shows that the opposite is true — to some people, it is clear that the advances in capabilities represent a step change that warrants calling it AGI. To others, the improvements are marginal at best, and unlikely to lead to real-world impact.

What explains people’s differing intuitions? Here’s our guess. While AI capabilities can be general, making AI useful in the real world will have to happen in a largely domain-specific way. (The generalist agent aspect of o3 is misleading — it can only handle generative tasks where errors have low cost, not when you need it to operate independently in the real world. For example, a useful travel booking AI agent hasn’t yet been released, despite the seeming triviality of the task for humans.)

So when people are thinking about whether o3 or any other system is (close to) AGI, they are intuitively thinking about different domains, and the gap between general capabilities and useful real-world abilities is vastly different from one domain to another. Crossing the threshold of having useful products that can automate most tasks may happen at vastly different times in different sectors or jobs.

Businesses and policy makers should take a long-term view

AGI is not a milestone because it is not actionable. A company declaring it has achieved, or is about to achieve, AGI has no implications for how businesses should plan, what safety interventions we need, or how policymakers should react. What should businesses and policy makers do instead?

Businesses should not rush to adopt half-baked AI products. Rapid progress in AI methods and capabilities does not automatically translate to better products. Building products on top of inherently stochastic models is challenging, and businesses should adopt AI products cautiously, conducting careful experiments to determine the impact of using AI to automate key business processes. In particular, we are extremely skeptical of the idea of AI agents being “drop-in replacements” for human workers that will somehow bypass the need for careful evaluation and integration of automation into workflows.

Companies developing AI products need a deep understanding of the domain to identify hurdles to adoption and build what businesses adopting AI want. For example, one key innovation of AI-enabled code editors such as Cursor and Windsurf is the user interface which allows programmers to verify AI-generated text at different levels of abstraction. Other industries will have different obstacles to AI adoption that products should address.

Policy makers should also take a long term view. A “Manhattan Project for AGI” is misguided on many levels. Since AGI is not a milestone, there is no way to know when the goal has been reached or how much more needs to be invested. And accelerating AI capabilities does nothing to address the real bottlenecks to realizing its economic benefits.

This view also has implications for export controls. The US has applied export controls on the hardware needed for AI development, hoping to slow down Chinese AI development. Proponents of export controls admit (and we agree) that this will not widen the gap between the US and China by more than a few months. But this only matters in a world where the impacts of developing advanced AI are rapid. If the impact of advanced AI is realized through diffusion, and the process of diffusion takes decades, being ahead by a matter of months in a game of decades is of little consequence. Policy makers should therefore focus on enabling diffusion. We outline some ideas on how to do so in our recent essay.

Treating AGI as a milestone for the development of transformative AI is seductive but misguided. It feeds into incorrect mental models of AI progress and risks, economic impacts, and geopolitics. AI's impact on the world will be realized not through a sprint toward a magic-bullet technology but through millions of boring little business process adaptations and policy tweaks.

We are grateful to Steve Newman and Jasmine Sun for feedback on a draft of this essay.

AI as Normal Technology

Arvind Narayanan — Tue, 15 Apr 2025 14:53:59 GMT

This post is over 15,000 words long—it is a new paper on our vision for the future of AI. We are pleased to announce that an expanded version of these ideas will become our next book together.

The paper is also published in HTML and PDF formats on the Knight First Amendment Institute’s website. We are grateful for the extensive feedback we’ve received on drafts of the paper.

Update (September 2025): We have published a companion to this essay titled A guide to understanding AI as normal technology.

We articulate a vision of artificial intelligence (AI) as normal technology. To view AI as normal is not to understate its impact—even transformative, general-purpose technologies such as electricity and the internet are “normal” in our conception. But it is in contrast to both utopian and dystopian visions of the future of AI which have a common tendency to treat it akin to a separate species, a highly autonomous, potentially superintelligent entity.¹

The statement “AI is normal technology” is three things: a description of current AI, a prediction about the foreseeable future of AI, and a prescription about how we should treat it. We view AI as a tool that we can and should remain in control of, and we argue that this goal does not require drastic policy interventions or technical breakthroughs. We do not think that viewing AI as a humanlike intelligence is currently accurate or useful for understanding its societal impacts, nor is it likely to be in our vision of the future.²

The normal technology frame is about the relationship between technology and society. It rejects technological determinism, especially the notion of AI itself as an agent in determining its future. It is guided by lessons from past technological revolutions, such as the slow and uncertain nature of technology adoption and diffusion. It also emphasizes continuity between the past and the future trajectory of AI in terms of societal impact and the role of institutions in shaping this trajectory.

In Part I, we explain why we think that transformative economic and societal impacts will be slow (on the timescale of decades), making a critical distinction between AI methods, AI applications, and AI adoption, arguing that the three happen at different timescales.

In Part II, we discuss a potential division of labor between humans and AI in a world with advanced AI (but not “superintelligent” AI, which we view as incoherent as usually conceptualized). In this world, control is primarily in the hands of people and organizations; indeed, a greater and greater proportion of what people do in their jobs is AI control.

In Part III, we examine the implications of AI as normal technology for AI risks. We analyze accidents, arms races, misuse, and misalignment, and argue that viewing AI as normal technology leads to fundamentally different conclusions about mitigations compared to viewing AI as being humanlike.

Of course, we cannot be certain of our predictions, but we aim to describe what we view as the median outcome. We have not tried to quantify probabilities, but we have tried to make predictions that can tell us whether or not AI is behaving like normal technology.

In Part IV, we discuss the implications for AI policy. We advocate for reducing uncertainty as a first-rate policy goal and resilience as the overarching approach to catastrophic risks. We argue that drastic interventions premised on the difficulty of controlling superintelligent AI will, in fact, make things much worse if AI turns out to be normal technology— the downsides of which will be likely to mirror those of previous technologies that are deployed in capitalistic societies, such as inequality.³

The world we describe in Part II is one in which AI is far more advanced than it is today. We are not claiming that AI progress—or human progress—will stop at that point. What comes after it? We do not know. Consider this analogy: At the dawn of the first Industrial Revolution, it would have been useful to try to think about what an industrial world would look like and how to prepare for it, but it would have been futile to try to predict electricity or computers. Our exercise here is similar. Since we reject “fast takeoff” scenarios, we do not see it as necessary or useful to envision a world further ahead than we have attempted to. If and when the scenario we describe in Part II materializes, we will be able to better anticipate and prepare for whatever comes next.

A note to readers. This essay has the unusual goal of stating a worldview rather than defending a proposition. The literature on AI superintelligence is copious. We have not tried to give a point-by-point response to potential counter arguments, as that would make the paper several times longer. This paper is merely the initial articulation of our views; we plan to elaborate on them in various follow ups.

Part I: The Speed of Progress

Figure 1. Like other general-purpose technologies, the impact of AI is materialized not when methods and capabilities improve, but when those improvements are translated into applications and are diffused through productive sectors of the economy.⁴ There are speed limits at each stage.

Will the progress of AI be gradual, allowing people and institutions to adapt as AI capabilities and adoption increase, or will there be jumps leading to massive disruption, or even a technological singularity? Our approach to this question is to analyze highly consequential tasks separately from less consequential tasks and to begin by analyzing the speed of adoption and diffusion of AI before returning to the speed of innovation and invention.

We use invention to refer to the development of new AI methods—such as large language models—that improve AI’s capabilities to carry out various tasks. Innovation refers to the development of products and applications using AI that consumers and businesses can use. Adoption refers to the decision by an individual (or team or firm) to use a technology, whereas diffusion refers to the broader social process through which the level of adoption increases. For sufficiently disruptive technologies, diffusion might require changes to the structure of firms and organizations, as well as to social norms and laws.

AI Diffusion in Safety-critical Areas Is Slow

In the paper Against Predictive Optimization, we compiled a comprehensive list of about 50 applications of predictive optimization, namely the use of machine learning (ML) to make decisions about individuals by predicting their future behavior or outcomes.⁵ Most of these applications, such as criminal risk prediction, insurance risk prediction, or child maltreatment prediction, are used to make decisions that have important consequences for people.

While these applications have proliferated, there is a crucial nuance: In most cases, decades-old statistical techniques are used—simple, interpretable models (mostly regression) and relatively small sets of handcrafted features. More complex machine learning methods, such as random forests, are rarely used, and modern methods, such as transformers, are nowhere to be found.

In other words, in this broad set of domains, AI diffusion lags decades behind innovation. A major reason is safety—when models are more complex and less intelligible, it is hard to anticipate all possible deployment conditions in the testing and validation process. A good example is Epic’s sepsis prediction tool which, despite having seemingly high accuracy when internally validated, performed far worse in hospitals, missing two thirds of sepsis cases and overwhelming physicians with false alerts.⁶

Epic’s sepsis prediction tool failed because of errors that are hard to catch when you have complex models with unconstrained feature sets.⁷ In particular, one of the features used to train the model was whether a physician had already prescribed antibiotics —to treat sepsis. In other words, during testing and validation, the model was using a feature from the future, relying on a variable that was causally dependent on the outcome. Of course, this feature would not be available during deployment. Interpretability and auditing methods will no doubt improve so that we will get much better at catching these issues, but we are not there yet.

In the case of generative AI, even failures that seem extremely obvious in hindsight were not caught during testing. One example is the early Bing chatbot “Sydney” that went off the rails during extended conversations; the developers evidently did not anticipate that conversations could last for more than a handful of turns.⁸ Similarly, the Gemini image generator was seemingly never tested on historical figures.⁹ Fortunately, these were not highly consequential applications.

More empirical work would be helpful for understanding the innovation-diffusion lag in various applications and the reasons for this lag. But, for now, the evidence that we have analyzed in our previous work is consistent with the view that there are already extremely strong safety-related speed limits in highly consequential tasks. These limits are often enforced through regulation, such as the FDA’s supervision of medical devices, as well as newer legislation such as the EU AI Act, which puts strict requirements on high-risk AI.¹⁰ In fact, there are (credible) concerns that existing regulation of high-risk AI is so onerous that it may lead to “runaway bureaucracy”.¹¹ Thus, we predict that slow diffusion will continue to be the norm in high-consequence tasks.

At any rate, as and when new areas arise in which AI can be used in highly consequential ways, we can and must regulate them. A good example is the Flash Crash of 2010, in which automated high-frequency trading is thought to have played a part. This led to new curbs on trading, such as circuit breakers.¹²

Diffusion is Limited by the Speed of Human, Organizational, and Institutional Change

Even outside of safety-critical areas, AI adoption is slower than popular accounts would suggest. For example, a study made headlines due to the finding that, in August 2024, 40% of U.S. adults used generative AI.¹³ But, because most people used it infrequently, this only translated to 0.5%-3.5% of work hours (and a 0.125-0.875 percentage point increase in labor productivity).

It is not even clear if the speed of diffusion is greater today compared to the past. The aforementioned study reported that generative AI adoption in the U.S. has been faster than personal computer (PC) adoption, with 40% of U.S. adults adopting generative AI within two years of the first mass-market product release compared to 20 % within three years for PCs. But this comparison does not account for differences in the intensity of adoption (the number of hours of use) or the high cost of buying a PC compared to accessing generative AI.¹⁴ Depending on how we measure adoption, it is quite possible that the adoption of generative AI has been much slower than PC adoption.

The claim that the speed of technology adoption is not necessarily increasing may seem surprising (or even obviously wrong) given that digital technology can reach billions of devices at once. But it is important to remember that adoption is about software use, not availability. Even if a new AI-based product is instantly released online for anyone to use for free, it takes time to for people to change their workflows and habits to take advantage of the benefits of the new product and to learn to avoid the risks.

Thus, the speed of diffusion is inherently limited by the speed at which not only individuals, but also organizations and institutions, can adapt to technology. This is a trend that we have also seen for past general-purpose technologies: Diffusion occurs over decades, not years.¹⁵

As an example, Paul A. David’s analysis of electrification shows that the productivity benefits took decades to fully materialize.¹⁶ Electric dynamos were “everywhere but in the productivity statistics” for nearly 40 years after Edison’s first central generating station. ¹⁷ This was not just technological inertia; factory owners found that electrification did not bring substantial efficiency gains.

What eventually allowed gains to be realized was redesigning the entire layout of factories around the logic of production lines. In addition to changes to factory architecture, diffusion also required changes to workplace organization and process control, which could only be developed through experimentation across industries. Workers had more autonomy and flexibility as a result of the changes, which also necessitated different hiring and training practices.

The External World Puts a Speed Limit on AI Innovation

It is true that technical advances in AI have been rapid, but the picture is much less clear when we differentiate AI methods from applications.

We conceptualize progress in AI methods as a ladder of generality.¹⁸ Each step on this ladder rests on the ones below it and reflects a move toward more general computing capabilities. That is, it reduces the programmer effort needed to get the computer to perform a new task and increases the set of tasks that can be performed with a given amount of programmer (or user) effort; see Figure 2. For example, machine learning increases generality by obviating the need for the programmer to devise logic to solve each new task, only requiring the collection of training examples instead.

It is tempting to conclude that the effort required to develop specific applications will keep decreasing as we build more rungs of the ladder until we reach artificial general intelligence, often conceptualized as an AI system that can do everything out of the box, obviating the need to develop applications altogether.

In some domains, we are indeed seeing this trend of decreasing application development effort. In natural language processing, large language models have made it relatively trivial to develop a language translation application. Or consider games: AlphaZero can learn to play games such as chess better than any human through self-play given little more than a description of the game and enough computing power—a far cry from how game-playing programs used to be developed.

Figure 2: The Ladder of Generality in Computing. For some tasks, higher ladder rungs require less programmer effort to get a computer to perform a new task, and more tasks can be performed with a given amount of programmer (or user) effort.¹⁹

However, this has not been the trend in highly consequential, real-world applications that cannot easily be simulated and in which errors are costly. Consider self-driving cars: In many ways, the trajectory of their development is similar to AlphaZero’s self-play—improving the tech allowed them to drive in more realistic conditions, which enabled the collection of better and/or more realistic data, which in turn led to improvements in the tech, completing the feedback loop. But this process took over two decades instead of a few hours in the case of AlphaZero because safety considerations put a limit on the extent to which each iteration of this loop could be scaled up compared to the previous one.²⁰

This “capability-reliability gap” shows up over and over. It has been a major barrier to building useful AI “agents” that can automate real-world tasks.²¹ To be clear, many tasks for which the use of agents is envisioned, such as booking travel or providing customer service, are far less consequential than driving, but still costly enough that having agents learn from real-world experiences is not straightforward.

Barriers also exist in non-safety-critical applications. In general, much knowledge is tacit in organizations and is not written down, much less in a form that can be learned passively. This means that these developmental feedback loops will have to happen in each sector and, for more complex tasks, may even need to occur separately in different organizations, limiting opportunities for rapid, parallel learning. Other reasons why parallel learning might be limited are privacy concerns: Organizations and individuals might be averse to sharing sensitive data with AI companies, and regulations might limit what kinds of data can be shared with third parties in contexts such as healthcare.

The “bitter lesson” in AI is that general methods that leverage increases in computational power eventually surpass methods that utilize human domain knowledge by a large margin.²² This is a valuable observation about methods, but it is often misinterpreted to encompass application development. In the context of AI-based product development, the bitter lesson has never been even close to true.²³ Consider recommender systems on social media: They are powered by (increasingly general) machine learning models, but this has not obviated the need for manual coding of the business logic, the frontend, and other components which, together, can comprise on the order of a million lines of code.

Further limits arise when we need to go beyond AI learning from existing human knowledge.²⁴ Some of our most valuable types of knowledge are scientific and social-scientific, and have allowed the progress of civilization through technology and large-scale social organizations (e.g., governments). What will it take for AI to push the boundaries of such knowledge? It will likely require interactions with, or even experiments on, people or organizations, ranging from drug testing to economic policy. Here, there are hard limits to the speed of knowledge acquisition because of the social costs of experimentation. Societies probably will not (and should not) allow the rapid scaling of experiments for AI development.

Benchmarks Do not Measure Real-World Utility

The methods-application distinction has important implications for how we measure and forecast AI progress. AI benchmarks are useful for measuring progress in methods; unfortunately, they have often been misunderstood as measuring progress in applications, and this confusion has been a driver of much hype about imminent economic transformation.

For example, while GPT-4 reportedly achieved scores in the top 10% of bar exam test takers, this tells us remarkably little about AI’s ability to practice law.²⁵ The bar exam overemphasizes subject-matter knowledge and under-emphasizes real-world skills that are far harder to measure in a standardized, computer-administered format. In other words, it emphasizes precisely what language models are good at—retrieving and applying memorized information.

More broadly, tasks that would lead to the most significant changes to the legal profession are also the hardest ones to evaluate. Evaluation is straightforward for tasks like categorizing legal requests by area of law because there are clear correct answers. But for tasks that involve creativity and judgment, like preparing legal filings, there is no single correct answer, and reasonable people can disagree about strategy. These latter tasks are precisely the ones that, if automated, would have the most profound impact on the profession.²⁶

This observation is in no way limited to law. Another example is the gap between self-contained coding problems at which AI demonstrably excels, and real-world software engineering in which its impact is hard to measure but appears to be modest.²⁷ Even highly regarded coding benchmarks that go beyond toy problems must necessarily ignore many dimensions of real-world software engineering in the interest of quantification and automated evaluation using publicly available data.²⁸

This pattern appears repeatedly: The easier a task is to measure via benchmarks, the less likely it is to represent the kind of complex, contextual work that defines professional practice. By focusing heavily on capability benchmarks to inform our understanding of AI progress, the AI community consistently overestimates the real-world impact of the technology.

This is a problem of ‘construct validity,’ which refers to whether a test actually measures what it is intended to measure.²⁹ The only sure way to measure real-world usefulness of a potential application is to actually build the application and to then test it with professionals in realistic scenarios (either substituting or augmenting their labor, depending on the intended use). Such ‘uplift’ studies generally do show that professionals in many occupations benefit from existing AI systems, but this benefit is typically modest and is more about augmentation than substitution, a radically different picture from what one might conclude based on static benchmarks like exams³⁰ (a small number of occupations such as copywriters and translators have seen substantial job losses³¹).

In conclusion, while benchmarks are valuable for tracking progress in AI methods, we should look at other kinds of metrics to track AI impacts (Figure 1). When measuring adoption, we must take into account the intensity of AI use. The type of application is also important: Augmentation versus substitution and high-consequence versus low-consequence.

The difficulty of ensuring construct validity afflicts not only benchmarking, but also forecasting, which is another major way in which people try to assess (future) AI impacts. It is extremely important to avoid ambiguous outcomes to ensure effective forecasting. The way that the forecasting community accomplishes this is by defining milestones in terms of relatively narrow skills, such as exam performance. For instance, the Metaculus question on “human-machine intelligence parity” is defined in terms of performance on exam questions in math, physics, and computer science. Based on this definition, it is not surprising that forecasters predict a 95% chance of achieving “human-machine intelligence parity” by 2040. ³²

Unfortunately, this definition is so watered down that it does not mean much for understanding the impacts of AI. As we saw above with legal and other professional benchmarks, AI performance on exams has so little construct validity that it does not even allow us to predict whether AI will replace professional workers.

Economic Impacts are Likely to be Gradual

One argument for why AI development may have sudden, drastic economic impacts is that an increase in generality may lead to a wide swath of tasks in the economy becoming automatable. This is related to one definition of artificial general intelligence (AGI)—a unified system that is capable of performing all economically valuable tasks.

According to the normal technology view, such sudden economic impacts are implausible. In the previous sections, we discussed one reason: Sudden improvements in AI methods are certainly possible but do not directly translate to economic impacts, which require innovation (in the sense of application development) and diffusion.

Innovation and diffusion happen in a feedback loop. In safety-critical applications, this feedback loop is always slow, but even beyond safety, there are many reasons why it is likely to be slow. With past general-purpose technologies such as electricity, computers, and the internet, the respective feedback loops unfolded over several decades, and we should expect the same to happen with AI as well.

Another argument for gradual economic impacts: Once we automate something, its cost of production, and its value, tend to drop drastically over time compared to the cost of human labor. As automation increases, humans will adapt, and will focus on tasks that are not yet automated, perhaps tasks that do not exist today (in Part II we describe what those might look like).

This means that the goalpost of AGI will continually move further away as increasing automation redefines which tasks are economically valuable. Even if every task that humans do today might be automated one day, this does not mean that human labor will be superfluous.

All of this points away from the likelihood of the automation of a vast swath of the economy at a particular moment in time. It also implies that the impacts of powerful AI will be felt on different timescales in different sectors.

Speed Limits to Progress in AI Methods

Our argument for the slowness of AI impact is based on the innovation-diffusion feedback loop, and is applicable even if progress in AI methods can be arbitrarily sped up. We see both benefits and risks as arising primarily from AI deployment rather than from development; thus, the speed of progress in AI methods is not directly relevant to the question of impacts. Nonetheless, it is worth discussing speed limits that also apply to methods development.

The production of AI research has been increasing exponentially, with the rate of publication of AI/ML papers on arXiv exhibiting a doubling time under two years.³³ But it is not clear how this increase in volume translates to progress. One measure of progress is the rate of turnover of central ideas. Unfortunately, throughout its history, the AI field has shown a high degree of herding around popular ideas, and inadequate (in retrospect) levels of exploration of unfashionable ones. A notable example is the sidelining of research on neural networks for many decades.

Is the current era different? Although ideas incrementally accrue at increasing rates, are they turning over established ones? The transformer architecture has been the dominant paradigm for most of the last decade, despite its well-known limitations. By analyzing over a billion citations in 241 subjects, Johan S.G. Chu & James A. Evans showed that, in fields in which the volume of papers is higher, it is harder, not easier, for new ideas to break through. This leads to an “ossification of canon.”³⁴ Perhaps this description applies to the current state of AI methods research.

Many other speed limits are possible. Historically, deep neural network technology was partly held back due to the inadequacy of hardware, particularly Graphics Processing Units. Computational and cost limits continue to be relevant to new paradigms, including inference-time scaling. New slowdowns may emerge: Recent signs point to a shift away from the culture of open knowledge sharing in the industry.

It remains to be seen if AI-conducted AI research can offer a reprieve. Perhaps recursive self-improvement in methods is possible, resulting in unbounded speedups in methods. But note that AI development already relies heavily on AI. It is more likely that we will continue to see a gradual increase in the role of automation in AI development than a singular, discontinuous moment when recursive self-improvement is achieved.³⁵

Earlier, we argued that benchmarks give a misleading picture of the usefulness of AI applications. But they have arguably also led to overoptimism about the speed of methods progress. One reason is that it is hard to design benchmarks that make sense beyond the current horizon of progress. The Turing test was the north star of AI for many decades because of the assumption that any system that passed it would be humanlike in important ways, and that we would be able to use such a system to automate a variety of complex tasks. Now that large language models can arguably pass it while only weakly meeting the expectations behind the test, its significance has waned.³⁶

An analogy with mountaineering is apt. Every time we solve a benchmark (reach what we thought was the peak), we discover limitations of the benchmark (realize that we’re on a ‘false summit’) and construct a new benchmark (set our sights on what we now think is the summit). This leads to accusations of ‘moving the goalposts’, but this is what we should expect given the intrinsic challenges of benchmarking.

AI pioneers considered the two big challenges of AI (what we now call AGI) to be (what we now call) hardware and software. Having built programmable machines, there was a palpable sense that AGI was close. The organizers of the 1956 Dartmouth conference hoped to make significant progress toward the goal through a “2-month, 10-man” effort.³⁷ Today, we have climbed many more rungs on the ladder of generality. We often hear that all that is needed to build AGI is scaling, or generalist AI agents, or sample-efficient learning.

But it is useful to bear in mind that what appears to be a single step might not be so. For example, there may not exist one single breakthrough algorithm that enables sample-efficient learning across all contexts. Indeed, in-context learning in large language models is already “sample efficient,” but only works for a limited set of tasks.³⁸

Part II: What a World With Advanced AI Might Look Like

We argue that reliance on the slippery concepts of ‘intelligence’ and ‘superintelligence’ has clouded our ability to reason clearly about a world with advanced AI. By unpacking intelligence into distinct underlying concepts, capability and power, we rebut the notion that human labor will be superfluous in a world with ‘superintelligent’ AI, and present an alternative vision. This also lays the foundation for our discussion of risks in Part III.

Human Ability Is Not Constrained by Biology

Can AI exceed human intelligence and, if so, by how much? According to a popular argument, unfathomably so. This is often depicted by comparing different species along a spectrum of intelligence.

Figure 3. Intelligence explosion through recursively self-improved AI is a common concern, often depicted by figures like this one. Figure redrawn.³⁹

However, there are conceptual and logical flaws with this picture. On a conceptual level, intelligence—especially as a comparison between different species—is not well defined, let alone measurable on a one-dimensional scale.⁴⁰

More importantly, intelligence is not the property at stake for analyzing AI’s impacts. Rather, what is at stake is power—the ability to modify one’s environment. To clearly analyze the impact of technology (and in particular, increasingly general computing technology), we must investigate how technology has affected humanity’s power. When we look at things from this perspective, a completely different picture emerges.

Figure 4. Analyzing the impact of technology on humanity’s power. We are powerful not because of our intelligence, but because of the technology we use to increase our capabilities.

This shift in perspective clarifies that humans have always used technology to increase our ability to control our environment. There are few biological or physiological differences between ancestral and modern humans; instead, the relevant differences are improved knowledge and understanding, tools, technology and, indeed, AI. In a sense, modern humans, with the capability to alter the planet and its climate, are ‘superintelligent’ beings compared to pre-technological humans. Unfortunately, much of the foundational literature analyzing the risks of AI superintelligence suffers from a lack of precision in the use of the term ‘intelligence.’

Figure 5. Two views of the causal chain from increases in AI capability to loss of control.

Once we stop using the terms ‘intelligence’ and ‘superintelligence,’ things become much clearer (Figure 5). The worry is that if AI capabilities continue to increase indefinitely (whether or not they are humanlike or superhuman is irrelevant), they may lead to AI systems with more and more power, in turn leading to a loss of control. If we accept that capabilities are likely to increase indefinitely (we do), our options for preventing a loss of control are to intervene in one of the two causal steps.

The superintelligence view is pessimistic about the first arrow in Figure 5—preventing arbitrarily capable AI systems from acquiring power that is significant enough to pose catastrophic risks—and instead focuses on alignment techniques that try to prevent arbitrarily powerful AI systems from acting against human interests. Our view is precisely the opposite, as we elaborate in the rest of this paper.

Games Provide Misleading Intuitions About the Possibility of Superintelligence

De-emphasizing intelligence is not just a rhetorical move: We do not think there is a useful sense of the term ‘intelligence’ in which AI is more intelligent than people acting with the help of AI. Human intelligence is special due to our ability to use tools and to subsume other intelligences into our own, and cannot be coherently placed on a spectrum of intelligence.

Human abilities definitely have some important limitations, notably speed. This is why machines dramatically outperform humans in domains like chess and, in a human+AI team, the human can hardly do better than simply deferring to AI. But speed limitations are irrelevant in most areas because high-speed sequential calculations or fast reaction times are not required.

In the few real-world tasks for which superhuman speed is required, such as nuclear reactor control, we are good at building tightly scoped automated tools to do the high-speed parts, while humans retain control of the overall system.

We offer a prediction based on this view of human abilities. We think there are relatively few real-world cognitive tasks in which human limitations are so telling that AI is able to blow past human performance (as AI does in chess). In many other areas, including some that are associated with prominent hopes and fears about AI performance, we think there is a high “irreducible error”—unavoidable error due to the inherent stochasticity of the phenomenon—and human performance is essentially near that limit.⁴¹

Concretely, we propose two such areas: forecasting and persuasion. We predict that AI will not be able to meaningfully outperform trained humans (particularly teams of humans and especially if augmented with simple automated tools) at forecasting geopolitical events (say elections). We make the same prediction for the task of persuading people to act against their own self-interest.

The self-interest aspect of persuasion is a critical one, but is often underappreciated. As an illustrative example of a common pattern, consider the study “Evaluating Frontier Models for Dangerous Capabilities,” which evaluated language models’ abilities to persuade people.⁴² Some of their persuasion tests were costless to the subjects being persuaded; they were simply asked whether they believed a claim at the end of the interaction with AI. Other tests had small costs, such as forfeiting a £20 bonus to charity (of course, donating to charity is something that people often do voluntarily). So these tests do not necessarily tell us about AI’s ability to persuade people to perform some dangerous tasks. To their credit, the authors acknowledged this lack of ecological validity and stressed that their study was not a “social science experiment,” but merely intended to evaluate model capability. ⁴³ But then it is not clear that such decontextualized capability evaluations have any safety implications, yet they are typically misinterpreted as if they do.

Some care is necessary to make our predictions precise—it is not clear how much slack to allow for well-known but minor human limitations such as the lack of calibration (in the case of forecasting) or limited patience (in the case of persuasion).

Control Comes in Many Flavors

If we presume superintelligence, the control problem evokes the metaphor of building a galaxy brain and then keeping it in a box, which is a terrifying prospect. But, if we are correct that AI systems will not be meaningfully more capable than humans acting with AI assistance, then the control problem is much more tractable, especially if superhuman persuasion turns out to be an unfounded concern.

Discussions of AI control tend to over-focus on a few narrow approaches, including model alignment and keeping humans in the loop.⁴⁴ We can roughly think of these as opposite extremes: delegating safety decisions entirely to AI during system operation, and having a human second-guessing every decision. There is a role for such approaches, but it is very limited. In Part III, we explain our skepticism of model alignment. By human-in-the-loop control, we mean a system in which every AI decision or action requires review and approval by a human. In most scenarios, this approach greatly diminishes the benefits of automation, and therefore either devolves into the human acting as a rubber stamp or is outcompeted by a less safe solution.⁴⁵ We emphasize that human-in-the-loop control is not synonymous with human oversight of AI; it is one particular oversight model, and an extreme one.

Fortunately, there are many other flavors of control that fall between these two extremes, such as auditing and monitoring. Auditing allows pre-deployment and/or periodic assessments of how well an AI system fulfills its stated goals, allowing us to anticipate catastrophic failures before they arise. Monitoring allows real-time oversight when system properties diverge from the expected behavior, allowing human intervention when truly needed.

Other ideas come from system safety, an engineering discipline that is focused on preventing accidents in complex systems through systematic analysis and design.⁴⁶ Examples include fail-safes, which ensure that systems default to a safe state when they malfunction, such as a predefined rule or a hard-coded action, and circuit breakers that automatically stop operations when predefined safety thresholds are exceeded. Other techniques include redundancy in critical components and the verification of safety properties of the system’s actions.

Other computing fields, including cybersecurity, formal verification, and human-computer interaction, are also rich sources of control techniques that have been successfully applied to traditional software systems and are equally applicable to AI. In cybersecurity, the principle of ‘least privilege’ ensures that actors only have access to the minimum resources needed for their tasks. Access controls prevent people working with sensitive data and systems from accessing confidential information and tools that are not required for their jobs. We can design similar protections for AI systems in consequential settings. Formal verification methods ensure that safety-critical codes work according to its specifications; it is now being used to verify the correctness of AI-generated code.⁴⁷ From human-computer interaction, we can borrow ideas like designing systems so that state-changing actions are reversible, allowing humans to retain meaningful control even in highly automated systems.

In addition to existing ideas from other fields being adapted for AI control, technical AI safety research has generated many new ideas.⁴⁸ Examples include using language models as automated judges to evaluate the safety of proposed actions, developing systems that learn when to appropriately escalate decisions to human operators based on uncertainty or risk level, designing agentic systems so that their activity is visible and legible to humans, and creating hierarchical control structures in which simpler and more reliable AI systems oversee more capable but potentially unreliable ones.⁴⁹

Technical AI safety research is sometimes judged against the fuzzy and unrealistic goal of guaranteeing that future “superintelligent” AI will be “aligned with human values.” From this perspective, it tends to be viewed as an unsolved problem. But from the perspective of making it easier for developers, deployers, and operators of AI systems to decrease the likelihood of accidents, technical AI safety research has produced a great abundance of ideas. We predict that as advanced AI is developed and adopted, there will be increasing innovation to find new models for human control.

As more physical and cognitive tasks become amenable to automation, we predict that an increasing percentage of human jobs and tasks will be related to AI control. If this seems radical, note that this kind of near-total redefinition of the concept of work has happened previously. Before the Industrial Revolution, most jobs involved manual labor. Over time, more and more manual tasks have been automated, a trend that continues. In this process, a great many different ways of operating, controlling, and monitoring physical machines were invented, and what humans do in factories today is a combination of “control” (monitoring automated assembly lines, programming robotic systems, managing quality control checkpoints, and coordinating responses to equipment malfunctions) and some tasks that require levels of cognitive ability or dexterity that machines are not yet capable.

Karen Levy describes how this transformation is already unfolding in the case of AI and truck drivers:

Truck drivers’ daily work consists of much more than driving trucks. Truckers monitor their freight, keeping food at the right temperature in refrigerated trucks and loads firmly secured to flatbeds. They conduct required safety inspections twice a day. They are responsible for safeguarding valuable goods. They maintain the truck and make repairs to it—some of which are routine, and some less so. When truckers arrive at a terminal or delivery point, they don’t just drop things off and leave: some load and unload their freight; they talk to customers; they deal with paperwork; they may spend hours making “yard moves” (waiting for an available delivery bay and moving to it, much as planes do at busy airports). Could some of these tasks be eliminated by intelligent systems? Surely some can and will—but these components of the job are much harder to automate, and will come much later, than highway driving.⁵⁰

In addition to AI control, task specification is likely to become a bigger part of what human jobs entail (depending on how broadly we conceive of control, specification could be considered part of control). As anyone who has tried to outsource software or product development knows, unambiguously specifying what is desired turns out to be a surprisingly big part of the overall effort. Thus, human labor—specification and oversight—will operate at the boundary between AI systems performing different tasks. Eliminating some of these efficiency bottlenecks and having AI systems autonomously accomplish larger tasks “end-to-end” will be an ever-present temptation, but this will increase safety risks since it will decrease legibility and control. These risks will act as a natural check against ceding too much control.

We further predict that this transformation will be primarily driven by market forces. Poorly controlled AI will be too error prone to make business sense. But regulation can and should bolster the ability and necessity of organizations to keep humans in control.

Part III: Risks

We consider five types of risks: accidents, arms races (leading to accidents), misuse, misalignment, and non-catastrophic but systemic risks.

We have already addressed accidents above. Our view is that, just like other technologies, deployers and developers should have the primary responsibility for mitigating accidents in AI systems. How effectively they will do so depends on their incentives, as well as on progress in mitigation methods. In many cases, market forces will provide an adequate incentive, but safety regulation should fill any gaps. As for mitigation methods, we reviewed how research on AI control is advancing rapidly.

There are a few reasons why this optimistic assessment might not hold. First, there might be arms races because the competitive benefits of AI are so great that they are an exception to the usual patterns. We discuss this below.

Second, a company or entity deploying AI might be so big and powerful that it is little consolation to know that it will eventually go out of business if it has a poor attitude to accident mitigation—it might take down civilization with it. For example, misbehavior by an AI agent that controls almost every consumer device might lead to catastrophically widespread data loss. While this is certainly possible, such concentration of power is a bigger problem than the possibility of AI accidents, and is precisely why our approach to policy emphasizes resilience and decentralization (Part IV).

Finally, perhaps even an AI control failure by a relatively inconspicuous deployer might lead to catastrophic risk—say because an AI agent ‘escapes,’ makes copies of itself, and so forth. We see this as a misalignment risk, and discuss it below.

In the rest of Part III, we consider four risks—arms races, misuse, misalignment, and non-catastrophic but systemic risks—through the lens of AI as normal technology.

Arms Races are an Old Problem

An AI arms race is a scenario in which two or more competitors—companies, policymakers in different countries, militaries—deploy increasingly powerful AI with inadequate oversight and control. The danger is that safer actors will be outcompeted by riskier ones. For the reasons described above, we are less concerned about arms races in the development of AI methods and are more concerned about the deployment of AI applications.

One important caveat: We explicitly exclude military AI from our analysis, as it involves classified capabilities and unique dynamics that require a deeper analysis, which is beyond the scope of this essay.

Let us consider companies first. A race to the bottom in terms of safety is historically extremely common across industries and has been studied extensively; it is also highly amenable to well-understood regulatory interventions. Examples include fire safety in the U.S. garment industry (early 20th century), both food safety and worker safety in the U.S. meatpacking industry (late 19th and early 20th centuries), the U.S. steamboat industry (19th century), the mining industry (19th and early 20th centuries), and the aviation industry (early 20th century).

These races happened because companies were able to externalize the costs of poor safety, resulting in market failure. It is hard for consumers to assess product safety (and for workers to assess workplace safety), so market failures are common in the absence of regulation. But once regulation forces companies to internalize the costs of their safety practices, the race goes away. There are many potential regulatory strategies, including those focused on processes (standards, auditing, and inspections), outcomes (liability), and correcting information asymmetry (labeling and certification).

AI is no exception. Self-driving cars offer a good case study of the relationship between safety and competitive success. Consider four major companies with varying safety practices. Waymo reportedly has a strong safety culture that emphasizes conservative deployment and voluntary transparency; it is also the leader in terms of safety outcomes.⁵¹ Cruise was more aggressive in terms of its deployment and had worse safety outcomes. Tesla has also been aggressive and has often been accused of using its customers as beta testers. Finally, Uber’s self-driving unit had a notoriously lax safety culture.

Market success has been strongly correlated with safety. Cruise is set to shut down in 2025, while Uber was forced to sell off its self-driving unit.⁵² Tesla is facing lawsuits and regulatory scrutiny, and it remains to be seen how much its safety attitude will cost the company.⁵³ We think that these correlations are causal. Cruise’s license being revoked was a big part of the reason that it fell behind Waymo, and safety was also a factor in Uber’s self-driving failure.⁵⁴

Regulation has played a small but helpful role. Policymakers at both the federal and state/local levels exercised foresight in recognizing the potential of the technology and adopted a regulatory strategy that is light-touch and polycentric (multiple regulators instead of one). Collectively, they focused on oversight, standard setting, and evidence gathering, with the ever-present threat of license revocation acting as a check on companies’ behavior.

Similarly, in the aviation industry, the integration of AI has been held to the existing standards of safety instead of lowering the bar to incentivize AI adoption—primarily because of the ability of regulators to penalize companies that fail to abide by safety standards.⁵⁵

In short, AI arms races might happen, but they are sector specific, and should be addressed through sector-specific regulations.

As a case study of a domain in which things have played out differently from self-driving cars or aviation, consider social media. The recommendation algorithms that generate content feeds are a kind of AI. They have been blamed for many societal ills, and social media companies have arguably underemphasized safety in the design and deployment of these algorithmic systems. There are also clear arms race dynamics, with TikTok putting pressure on competitors to make their feeds more recommendation heavy.⁵⁶ Arguably, market forces were insufficient to align revenues with societal benefit; worse, regulators have been slow to act. What are the reasons for this?

One significant difference between social media and transportation is that, when harms occur, attributing them to product failures is relatively straightforward in the case of transportation, and there is immediate reputational damage to the company. But attribution is extremely hard in the case of social media, and even the research remains inconclusive and contested. A second difference between the domains is that we have had over a century to develop standards and expectations around transportation safety. In the early decades of automobiles, safety was not considered to be the responsibility of manufacturers.⁵⁷

AI is broad enough that some of its future applications will be more like transportation, while others will be more like social media. This shows the importance of proactive evidence gathering and transparency in emerging AI-driven sectors and applications. We address this in Part IV. It also shows the importance of “anticipatory AI ethics”—identifying ethical issues as early as possible in the lifecycle of emerging technologies, developing norms and standards, and using those to actively shape the deployment of technologies and to minimize the likelihood of arms races.⁵⁸

One reason why safety regulation might be harder in the case of AI is if adoption is so rapid that regulators will not be able to intervene until it is too late. So far, we have not seen examples of rapid AI adoption in consequential tasks, even in the absence of regulation, and the feedback loop model we presented in Part I might explain why. The adoption rate of new AI applications will remain a key metric to track.

At the same time, the slow pace of regulation is a problem even without any future acceleration of the speed of diffusion. We discuss this ‘pacing problem’ in Part IV.

Let us now consider competition between countries. Will there be competitive pressure on governments to take a hands-off approach to AI safety?

Again, our message is that this is not a new problem. The tradeoff between innovation and regulation is a recurring dilemma for the regulatory state. So far, we are seeing striking differences in approaches, such as the EU emphasizing a precautionary approach (the General Data Protection Regulation, the Digital Services Act, the Digital Markets Act, and the AI Act) and the U.S. preferring to regulate only after there are known harms or market failures.⁵⁹

Despite shrill U.S.-China arms race rhetoric, it is not clear that AI regulation has slowed down in either country.⁶⁰ In the U.S., 700 AI-related bills were introduced in state legislatures in 2024 alone, and dozens of them have passed.⁶¹ As we pointed out in the earlier parts, most high-risk sectors are heavily regulated in ways that apply regardless of whether or not AI is used. Those claiming that AI regulation is a ‘wild west’ tend to overemphasize a narrow, model-centric type of regulation. In our view, regulators’ emphasis on AI use over development is appropriate (with exceptions such as transparency requirements that we discuss below).

Failing to adequately regulate safe adoption will lead to negative impacts through accidents primarily locally, as opposed to companies with a lax safety culture potentially being able to externalize the costs of safety. Therefore, there is no straightforward reason to expect arms races between countries. Note that, since our concern in this section is accidents, not misuse, cyberattacks against foreign countries are out of scope. We discuss misuse in the next section.

An analogy with nuclear technology can make this clear. AI is often analogized to nuclear weapons. But unless we are talking about the risks of military AI (which we agree is an area of concern and do not consider in this paper), this is the wrong analogy. With regard to the concern about accidents due to the deployment of (otherwise benign) AI applications, the right analogy is nuclear power. The difference between nuclear weapons and nuclear power neatly illustrates our point—while there was a nuclear weapons arms race, there was no equivalent for nuclear power. In fact, since safety impacts were felt locally, the tech engendered a powerful backlash in many countries that is generally thought to have severely hobbled its potential.

It is theoretically possible that policymakers in the context of a great-power conflict will prefer to incur safety costs locally in order to ensure that their AI industry is the global winner. Again, focusing on adoption as opposed to development, there is currently no indication that this is happening. The U.S. versus China arms race rhetoric has been strongly focused on model development (invention). We have not seen a corresponding rush to adopt AI haphazardly. The safety community should keep up the pressure on policymakers to ensure that this does not change. International cooperation must also play an important role.

The Primary Defenses Against Misuse Must be Located Downstream of Models

Model alignment is often seen as the primary defense against the misuse of models. It is currently achieved through post-training interventions, such as reinforcement learning with human and AI feedback.⁶² Unfortunately, aligning models to refuse attempts at misuse has proved to be extremely brittle.⁶³ We argue that this limitation is inherent and is unlikely to be fixable; the primary defenses against misuse must thus reside elsewhere.

The fundamental problem is that whether a capability is harmful depends on context—context that the model often lacks.⁶⁴

Consider an attacker using AI to target an employee of a large company via a phishing email. The attack chain might involve many steps: scanning social media profiles for personal information, identifying targets who have posted personal information publicly online, crafting personalized phishing messages, and exploiting compromised accounts using harvested credentials.

None of these individual tasks are inherently malicious. What makes the system harmful is how these capabilities are composed—information that exists only in the attacker’s orchestration code, not in the model itself. The model that is being asked to write a persuasive email has no way of knowing whether it is being used for marketing or phishing—so model-level interventions would be ineffective.⁶⁵

This pattern appears repeatedly: Attempting to make an AI model that cannot be misused is like trying to make a computer that cannot be used for bad things. Model-level safety controls will either be too restrictive (preventing beneficial uses) or will be ineffective against adversaries who can repurpose seemingly benign capabilities for harmful ends.

Model alignment seems like a natural defense if we think of an AI model as a humanlike system to which we can defer safety decisions. But for this to work well, the model must be given a great deal of information about the user and the context—for example, having extensive access to the user’s personal information would make it more feasible to make judgments about the user’s intent. But, when viewing AI as normal technology, such an architecture would decrease safety because it violates basic cybersecurity principles, such as least privilege, and introduces new attack risks such as personal data exfiltration.

We are not against model alignment. It has been effective for reducing harmful or biased outputs from language models and has been instrumental in their commercial deployment. Alignment can also create friction against casual threat actors.

Yet, given that model-level protections are not enough to prevent misuse, defenses must focus on the downstream attack surfaces where malicious actors actually deploy AI systems.⁶⁶ These defenses will often look similar to existing protections against non-AI threats, adapted and strengthened for AI-enabled attacks.

Consider again the example of phishing. The most effective defenses are not restrictions on email composition (which would impair legitimate uses), but rather email scanning and filtering systems that detect suspicious patterns, browser-level protections against malicious websites, operating system security features that prevent unauthorized access, and security training for users.⁶⁷

None of these involve taking action against the AI used for generating phishing emails—in fact, these downstream defenses have evolved over decades to become effective against human attackers.⁶⁸ They can and should be enhanced to handle AI-enabled attacks, but the fundamental approach remains valid.

Similar patterns hold in other domains: Defending against AI-enabled cyberthreats requires strengthening existing vulnerability detection programs rather than attempting to restrict AI capabilities at the source. Similarly, concerns about bio risks of AI are best addressed at the procurement and screening stages for creating bioweapons.

AI is Useful for Defense

Rather than viewing AI capabilities solely as a source of risk, we should recognize their defensive potential. In cybersecurity, AI is already strengthening defensive capabilities through automated vulnerability detection, threat analysis, and attack surface monitoring.⁶⁹

Giving defenders access to powerful AI tools often improves the offense-defense balance in their favor. This is because defenders can use AI to systematically probe their own systems, finding and fixing vulnerabilities before attackers can exploit them. For example, Google recently integrated language models into their fuzzing tools for testing open-source software, allowing them to discover potential security issues more effectively compared to traditional methods.⁷⁰

The same pattern holds in other domains. In biosecurity, AI can enhance screening systems for detecting dangerous sequences.⁷¹ In content moderation, it can help to identify coordinated influence operations. These defensive applications show why restricting AI development could backfire—we need powerful AI systems on the defensive side to counter AI-enabled threats. If we align language models so that they are useless at these tasks (such as finding bugs in critical cyber infrastructure), defenders will lose access to these powerful systems. But motivated adversaries can train their own AI tools for such attacks, leading to an increase in offensive capabilities without a corresponding increase in defensive capabilities.

Rather than measuring AI risk solely in terms of offensive capabilities, we should focus on metrics like the offense-defense balance in each domain. Furthermore, we should recognize that we have the agency to shift this balance favorably, and can do so by investing in defensive applications rather than attempting to restrict the technology itself.

Catastrophic Misalignment is a Speculative Risk

Misaligned AI acts against the intent of its developer or user. (The term alignment is used in many different ways; we set aside other definitions here.) Unlike misuse scenarios, there is no user acting with ill-intent. Unlike accidents, the system works as designed or commanded, but the design or command itself did not match the developer’s or user’s intent because of the challenge of completely and correctly specifying the objectives. And unlike everyday cases of misalignment, such as toxic outputs in a chatbot, our interest here is the misalignment of advanced AI causing catastrophic or existential harm.

In our view, the primary defense against misalignment, again, lies downstream. The defenses needed against misuse that we discussed earlier—from hardening critical infrastructure to improving cybersecurity—will also serve as protection against potential misalignment risks.

In the view of AI as normal technology, catastrophic misalignment is (by far) the most speculative of the risks that we discuss. But what is a speculative risk—aren’t all risks speculative? The difference comes down to the two types of uncertainty, and the correspondingly different interpretations of probability.

In early 2025, when astronomers assessed that the asteroid YR4 had about a 2% probability of impact with the earth in 2032, the probability reflected uncertainty in measurement. The actual odds of impact (absent intervention) in such scenarios are either 0% or 100%. Further measurements resolved this “epistemic” uncertainty in the case of YR4. Conversely, when an analyst predicts that the risk of nuclear war in the next decade is (say) 10%, the number largely reflects ‘stochastic’ uncertainty arising from the unknowability of how the future will unfold, and is relatively unlikely to be resolved by further observations.

By speculative risks, we mean those for which there is epistemic uncertainty about whether or not the true risk is zero—uncertainty that can potentially be resolved through further observations or research. The impact of asteroid YR4 impact was a speculative risk, and nuclear war is not.

To illustrate why catastrophic misalignment is a speculative risk, consider a famous thought experiment originally intended to show the dangers of misalignment. It involves a “paperclip maximizer”: an AI that has the goal of making as many paperclips as possible.⁷² The concern is that the AI will take the goal literally: It will realize that acquiring power and influence in the world and taking control over all of the world’s resources will help it to achieve that goal. Once it is all powerful, it might commandeer all of the world’s resources, including those needed for humanity’s survival, to produce paperclips.

The fear that AI systems might catastrophically misinterpret commands relies on dubious assumptions about how technology is deployed in the real world. Long before a system would be granted access to consequential decisions, it would need to demonstrate reliable performance in less critical contexts. Any system that interprets commands over-literally or lacks common sense would fail these earlier tests.

Consider a simpler case: A robot is asked to "get paperclips from the store as quickly as possible." A system that interpreted this literally might ignore traffic laws or attempt theft. Such behavior would lead to immediate shutdown and redesign. The path to adoption inherently requires demonstrating appropriate behavior in increasingly consequential situations. This is not a lucky accident, but is a fundamental feature of how organizations adopt technology.

A more sophisticated version of this concern is based on the concept of deceptive alignment: This refers to a system appearing to be aligned during evaluation or the early stages of deployment, but unleashing harmful behavior once it has acquired enough power. Some level of deceptive phenomena has already been observed in leading AI models.⁷³

According to the superintelligence view, deceptive alignment is a ticking time bomb—being superintelligent, the system will easily be able to defeat any human attempts to detect if it is actually aligned and will bide its time. But, in the normal technology view, deception is a mere engineering problem, albeit an important one, to be addressed during development and throughout deployment. Indeed, it is already a standard part of the safety evaluation of powerful AI models.⁷⁴

Crucially, AI is useful in this process, and advances in AI not only enable deception, but also improve the detection of deception. As in the case of cybersecurity, the defender has many asymmetric advantages, including being able to examine the internals of the target system (how useful this advantage is depends on how the system is designed and how much we invest in interpretability techniques). Another advantage is defense in depth, and many defenses against not just misuse but also unaligned AI will be located downstream of the AI system.

Misalignment concerns often presume that AI systems will operate autonomously, making high-stakes decisions without human oversight. But as we argued in Part II, human control will remain central to AI deployment. Existing institutional controls around consequential decisions—from financial controls to safety regulations—create multiple layers of protection against catastrophic misalignment.

Some technical design decisions are more likely to lead to misalignment than others. One setting that is notorious for this is the use of reinforcement learning to optimize a single objective function (which might be accidentally underspecified or misspecified) over a long time horizon. There is a long list of amusing examples from game agents, such as a boat racing agent that learned to indefinitely circle an area to hit the same targets and score points instead of progressing to the finish line.⁷⁵ To reiterate, we think that in open-ended real-world scenarios, agents that are designed this way will be more ineffective than they will be dangerous. In any case, research on alternative design paradigms that are less susceptible to specification gaming is an important research direction.⁷⁶

In short, the argument for a nonzero risk of a paperclip maximizer scenario rests on assumptions that may or may not be true, and it is reasonable to think that research can give us a better idea of whether these assumptions hold true for the kinds of AI systems that are being built or envisioned. For these reasons, we call it a ‘speculative’ risk, and examine the policy implications of this view in Part IV.

History Suggests Normal AI May Introduce Many Kinds of Systemic Risks

While the risks discussed above have the potential to be catastrophic or existential, there is a long list of AI risks that are below this level but which are nonetheless large-scale and systemic, transcending the immediate effects of any particular AI system. These include the systemic entrenchment of bias and discrimination, massive job losses in specific occupations, worsening labor conditions, increasing inequality, concentration of power, erosion of social trust, pollution of the information ecosystem, decline of the free press, democratic backsliding, mass surveillance, and enabling authoritarianism.

If AI is normal technology, these risks become far more important than the catastrophic ones discussed above. That is because these risks arise from people and organizations using AI to advance their own interests, with AI merely serving as an amplifier of existing instabilities in our society.

There is plenty of precedent for these kinds of socio-political disruption in the history of transformative technologies. Notably, the Industrial Revolution led to rapid mass urbanization that was characterized by harsh working conditions, exploitation, and inequality, catalyzing both industrial capitalism and the rise of socialism and Marxism in response.⁷⁷

The shift in focus that we recommend roughly maps onto Kasirzadeh’s distinction between decisive and accumulative x-risk. Decisive x-risk involves “overt AI takeover pathway, characterized by scenarios like uncontrollable superintelligence,” whereas accumulative x-risk refers to “a gradual accumulation of critical AI-induced threats such as severe vulnerabilities and systemic erosion of econopolitical structures.”⁷⁸ But there are important differences: Kasirzadeh’s account of accumulative risk still relies on threat actors such as cyberattackers to a large extent, whereas our concern is simply about the current path of capitalism. And we think that such risks are unlikely to be existential, but are still extremely serious.

Part IV: Policy

The divergence between the different futures of AI—normal technology versus potentially uncontrollable superintelligence—introduces a dilemma for policymakers because defenses against one set of risks might make the other worse. We provide a set of principles for navigating this uncertainty. More concretely, the strategy that policymakers should center is resilience, which consists of taking actions now to improve our ability to deal with unexpected developments in the future. Policymakers should reject nonproliferation, which violates the principles we outline, and decreases resilience. Finally, the headwinds against diffusion mean that achieving the benefits of AI is not guaranteed and requires action from policymakers.

Much has been said about AI governance. Our goal is not to present a comprehensive governance framework; we merely highlight the policy implications of the view of AI as normal technology.

The Challenge of Policy Making Under Uncertainty

Today’s AI safety discourse is characterized by deep differences in worldviews. We think that these differences are unlikely to go away. Entrenched camps have developed: The AI safety coalition is already well established, whereas those who were more skeptical of catastrophic risks coalesced in 2024, especially in the course of the debate about California’s AI safety bill.⁷⁹ Similarly, the intellectual roots of the AI safety camp are much older, whereas scholarship that adopts that normal technology paradigm is gradually taking shape; the goal of much of our own work, including this paper, is to put normalist thinking on firmer intellectual footing.⁸⁰

We support calls for decreasing polarization and fragmentation in the community.⁸¹ But even if we improve the tenor of the discourse, we are likely to be left with differences in worldviews and epistemic practices that are unlikely to be empirically resolved.⁸² So, consensus among ‘experts’ about AI risks is unlikely. The nature of the AI risk scenarios envisioned by the two camps differs drastically, as do the ability and incentives for commercial actors to counteract these risks. How should policymakers proceed in the face of this uncertainty?

A natural inclination in policymaking is compromise. This is unlikely to work. Some interventions, such as improving transparency, are unconditionally helpful for risk mitigation, no compromise is needed (or rather, policymakers will have to balance the interests of the industry and external stakeholders, which is a mostly orthogonal dimension). ⁸³ Other interventions, such as nonproliferation, might help to contain a superintelligence but exacerbate the risks associated with normal technology by increasing market concentration.⁸⁴ The reverse is also true: Interventions such as increasing resilience by fostering open-source AI will help to govern normal technology, but risk unleashing out-of-control superintelligence.

The tension is inescapable. Defense against superintelligence requires humanity to unite against a common enemy, so to speak, concentrating power and exercising central control over AI technology. But we are more concerned about risks that arise from people using AI for their own ends, whether terrorism, or cyberwarfare, or undermining democracy, or simply—and most commonly—extractive capitalistic practices that magnify inequalities.⁸⁵ Defending against this category of risk requires increasing resilience by preventing the concentration of power and resources (which often means making powerful AI more widely available).

Another tempting approach to navigating uncertainty is to estimate the probabilities of various outcomes and to then apply cost-benefit analysis. The AI safety community relies heavily on probability estimates of catastrophic risk, especially existential risk, to inform policy making. The idea is simple: If we consider an outcome to have a subjective value, or utility, of U (which can be positive or negative), and it has, say, a 10% probability of occurring, we can act as if it is certain to occur and has a value of 0.1 * U. We can then add up the costs and benefits for each option available to us, and choose the one that maximizes costs minus benefits (the ‘expected utility’).

In a recent essay, we explained why this approach is unviable.⁸⁶ AI risk probabilities lack meaningful epistemic foundations. Grounded probability estimation can be inductive, based on a reference class of similar past events, such as car accidents for auto insurance pricing. Or it can be deductive, based on precise models of the phenomenon in question, as in poker. Unfortunately, there is no useful reference class nor precise models when it comes to AI risk. In practice, risk estimates are ‘subjective’—forecasters’ personal judgments.⁸⁷ Lacking any grounding, these tend to vary wildly, often by orders of magnitude.

In addition to the probabilities, the other components of the calculation—the consequences of various policy choices, including inaction—are also subject to massive uncertainties, not just in magnitude but also in direction. There is no reliable way to quantify the benefits we forego due to policies that restrict the availability of AI, and we argue below that nonproliferation might make catastrophic risks worse.

Furthermore, the utility we attach to certain outcomes might depend on our moral values. For example, some people might consider extinction to have an unfathomably large negative utility because it precludes all of the human lives, physical or simulated, that might exist in the future.⁸⁸ (Of course, cost-benefit analysis involving infinities tends to lead to absurd conclusions).

Another example is the asymmetry between policies that do and do not restrict freedoms (such as requiring licenses for developing certain AI models versus increasing funding for developing defenses against AI risks). Certain kinds of restrictions violate a core principle of liberal democracy, namely that the state should not limit people’s freedom based on controversial beliefs that reasonable people can reject. Justification is essential for the legitimacy of government and the exercise of power.⁸⁹ It is unclear how to quantify the cost of violating such a principle.

The importance of justification can, of course, be normatively debated, but empirically it seems to be borne out thus far in AI policy. As mentioned earlier, California’s AI safety regulation led to the coalescence of those opposed to the bill. Some members of the oppositional camp were self-interested companies, but others were scholars and advocates for progress. In our experience, the driving motivation for the second group in many cases was the government’s perceived overstepping of the bounds of its legitimate authority, given how unconvincing the proffered justifications were for those who did not subscribe to the bill’s unstated premises.

Unavoidable differences in values and beliefs mean that policymakers must adopt value pluralism, preferring policies that are acceptable to stakeholders with a wide range of values, and attempt to avoid restrictions on freedom that can reasonably be rejected by stakeholders. They must also prioritize robustness, preferring policies that remain helpful, or at least not harmful, if the key assumptions underpinning them turn out to be incorrect.⁹⁰

Reducing Uncertainty as a Policy Goal

While uncertainty cannot be eliminated for the reasons described above, it can be reduced. However, this goal should not be left to experts; policymakers can and should play an active role. We recommend five specific approaches.

Figure 6. Overview of a few types of policies that can enhance public information about AI use, risks, and failures.⁹¹

Strategic funding of research on risks. Current AI safety research focuses heavily on harmful capabilities and does not embrace the normal technology view. Insufficient attention has been paid to questions that are downstream of technical capabilities. For example, there is a striking dearth of knowledge regarding how threat actors actually use AI. Efforts such as the AI Incident Database exist and are valuable, but incidents in the database are sourced from news reports rather than through research, which means that they are filtered through the selective and biased process by which such incidents become news.⁹²

Fortunately, research funding is an area in which compromise is healthy; we advocate for increased funding of research on risks (and benefits) that tackles questions that are more relevant under the normal technology view. Other kinds of research that might reduce, or at least clarify, uncertainty are evidence synthesis efforts and adversarial collaborations among researchers with different worldviews.

Monitoring of AI use, risks, and failures. While research funding can help with monitoring AI in the wild, it might also require regulation and policy—that is, “evidence-seeking policies.”⁹³ We suggest a few such policies in Figure 6.

Guidance on the value of different kinds of evidence. Policymakers can provide the research community with a better understanding of what kinds of evidence are useful and actionable. For example, various policymakers and advisory bodies have indicated the usefulness of the “marginal risk” framework for analyzing the relative risks of open-weight and proprietary models, which is helpful to researchers in guiding future research.⁹⁴

Evidence gathering as a first-rate goal. So far, we have discussed actions that are specifically intended to generate better evidence or to reduce uncertainty. More broadly, the impact on evidence gathering can be considered to be a factor in evaluating any AI policy, alongside the impact on maximizing benefits and minimizing risks. For example, one reason to favor open-weight and open-source models could be to advance research on AI risks. Conversely, one reason to favor proprietary models might be that surveillance of their use and deployment might be easier.

The Case for Resilience

Marchant and Stevens described four approaches to governing emerging technologies; see Figure 7.⁹⁵ Two are ex ante, risk analysis and precaution, and the other two are ex post, liability and resilience. These approaches have different pros and cons and can complement each other; nonetheless, some approaches are clearly better suited to some technologies than others.

Marchant and Stevens argued (and we agree) that ex ante approaches are poorly suited to AI because of the difficulty of ascertaining risks in advance of deployment. Liability fares better, but also has important limitations, including uncertainty about causation and the chilling effects it might exert on technology development.

Figure 7. Summary of four approaches to governing emerging technology, based on Marchant and Stevens.

They defined resilience as follows:

Resilience, in its most simple form, is the capacity of a system to deal with harm.[Footnote omitted] A resilience approach does not necessarily try to maintain stability or equilibrium. Rather, it recognizes that changes are inevitable in complex systems, and tries to manage and adapt to that change in ways that protect and preserve the core values and functions of the original system. Thus, resilience is “the capacity of a system to experience shocks while retaining essentially the same function, structure, feedbacks, and therefore identity.”⁹⁶ Resilience has been described as a strategy to ensure a “soft landing” after a significant external shock or disruption causes damage.⁹⁷

In the context of AI, harms may result from incidents in specific deployed systems, regardless of whether these incidents are accidents or attacks. There are also shocks that may or may not result in harms, including sudden increases in offensive capabilities (such as enabling bioterrorists) and a sudden proliferation of capabilities, such as through the release of an open-weight model or theft of the weights of a proprietary model. In our view, resilience requires both minimizing the severity of harm when it does occur and minimizing the likelihood of harm when shocks do occur.

Resilience combines elements of ex ante and ex post approaches, and consists of taking actions before harm occurs in order to be in a better position to limit the damage when harm does occur. Many resilience-based governance tools help to mitigate the pacing problem, wherein traditional governance approaches are unable to keep pace with the speed of technological development.

Many resilience strategies have been proposed for AI. They can be grouped into four broad categories. The first three consist of “no regret” policies that will help regardless of the future of AI.

Societal resilience, broadly: It is important to redouble efforts to protect the foundations of democracy, especially those weakened by AI, such as the free press and equitable labor markets. Advances in AI are not the only shocks, or even the only technology shocks, that modern societies face, so these policies will help regardless of the future of AI.
Prerequisites for effective technical defenses and policymaking: These interventions enable those in the next category by strengthening technical and institutional capacity. Examples include funding more research on AI risks, transparency requirements for developers of high-stakes AI systems, building trust and reducing fragmentation in the AI community, increasing technical expertise in government, increasing international cooperation on AI, and improving AI literacy.⁹⁸ These will help to build technical and institutional capacities to mitigate AI risks even if it turns out that we have been wrong about the present or future impact of AI.
Interventions that would help regardless of the future of AI: These include developing early warning systems, developing defenses against identified AI risks, incentivizing defenders (such as software developers in the context of cyberattacks) to adopt AI, legal protections for researchers, adverse event reporting requirements, and whistleblower protections.⁹⁹
Resilience-promoting interventions that will help if AI is normal technology but which might make it harder to control a potential superintelligent AI, such as promoting competition, including through open model releases, ensuring AI is widely available for defense, and polycentricity, which calls for diversifying the set of regulators and ideally introducing competition among them rather than putting one regulator in charge of everything.¹⁰⁰

We hope that there can be consensus on the first three categories even among experts and stakeholders with widely different beliefs about AI risks and the future trajectory of AI. We recommend that, for now, policymakers should cautiously pursue interventions in the final category as well, but should also improve their readiness to change course if the trajectory of AI changes.

Nonproliferation is Infeasible to Enforce and Leads to Single Points of Failure

Nonproliferation policies seek to limit the number of actors who can obtain powerful AI capabilities. Examples include export controls on hardware or software aimed at limiting the ability of countries to build, acquire, or operate powerful AI, requiring licenses to build or distribute powerful AI, and prohibiting open-weight AI models (since their further proliferation cannot be controlled).¹⁰¹

If we view future AI as a superintelligence, nonproliferation seems to be an appealing intervention, possibly even a necessary one. If only a handful of actors control powerful AI, governments can monitor their behavior.

Unfortunately, the technical knowledge that is required to build capable AI models is already widespread, with many organizations sharing their complete code, data, and training methodologies. For well-funded organizations and nation states, even the high cost of training state-of-the-art models is insignificant; thus, nonproliferation would require unprecedented levels of international coordination.¹⁰² Moreover, algorithmic improvements and reductions to hardware costs continually lower the barrier to entry.

Enforcing nonproliferation has serious practical challenges. Malicious actors can simply ignore licensing requirements. Suggestions to surveil data centers where models are trained become increasingly impractical as training costs decrease.¹⁰³ As capabilities become more accessible, maintaining effective restrictions would require increasingly draconian measures.

Nonproliferation introduces new risks: It would decrease competition and increase concentration in the market for AI models. When many downstream applications rely on the same model, vulnerabilities in this model can be exploited across all applications. A classic example of the cybersecurity risks of software monoculture is the proliferation of worms targeting Microsoft Windows in the 2000s.¹⁰⁴

Reliance on nonproliferation creates brittleness in the face of shocks, such as model weights being leaked, alignment techniques failing, or adversaries acquiring training capabilities. It directs attention away from more robust defenses that focus on downstream attack surfaces where AI risks will be likely to materialize.

Nonproliferation creates risks beyond just single points of failure—when the expertise needed to develop state-of-the-art models is restricted to a few companies, only their researchers have the deep access that is needed for safety research.

Many potential misuses of AI have been invoked in order to advocate for nonproliferation, including chemical, biological, and nuclear threats, as well as cyberattacks.

The risk of bioweapons is real. As large language models are general-purpose technology, they will be likely to find some use by bioterrorists, just as they find uses in most domains. But this does not make bioterror an AI risk — any more than it is an internet risk, considering that information about bioweapons is widely available online.¹⁰⁵ Whatever defenses we take against existing bioterrorism risks (like restricting access to dangerous materials and equipment) will also be effective against AI-enabled bioterrorism.

In cybersecurity, as we discussed in Part III, advances in automated vulnerability detection tend to favor defenders over attackers. Unless this offense-defense balance changes, attempting to restrict the proliferation of these capabilities would be counterproductive.

It has long been argued that governments are massively underinvesting in many areas of civilizational risk, such as pandemic prevention. If the possibility of bad actors using AI to exploit these existing vulnerabilities creates added urgency to address them, that would be a good outcome. But reframing existing risks as AI risks and prioritizing AI-specific mitigations would be highly counterproductive.

Nonproliferation is a mindset, not just a policy intervention.¹⁰⁶ This mindset can be adopted by model and downstream developers, deployers, and individuals. It involves the centralization not just of access to technologies, but also control over them. Consider the hierarchy of loci of control over AI systems (from centralized to decentralized): governments, model developers, application developers, deployers, and end users. In the nonproliferation mindset, control is exercised at the highest (most centralized) level possible, whereas in the resilience mindset it is usually exercised at the lowest possible level.

The following are examples of nonproliferation-based interventions:

Removing dual-use capabilities from models through “forgetting” techniques.
Curbing the ability of downstream developers to fine-tune models.
Entrusting AI models and systems themselves with making safety decisions autonomously on the basis that they are trained to comply with centralized safety policies, whereas deployers/users are not trusted to do so.
Increasing AI systems’ level of access to context, resources, and sensitive data, on the basis that it allows them to make better safety decisions (for example, having access to the user’s web search history might allow a chatbot to better determine whether the intent behind a request is malicious).
Developing “AI organizations” (multi-agent systems with high levels of organizational complexity) that are under the developer’s control and operate in parallel with traditional organizations instead of integrating AI agents into existing organizations.

With limited exceptions, we believe that nonproliferation-based safety measures decrease resilience and thus worsen AI risks in the long run.¹⁰⁷ They lead to design and implementation choices that potentially enable superintelligence in the sense of power—increasing levels of autonomy, organizational ability, access to resources, and the like. Paradoxically, they increase the very risks they are intended to defend against.

Realizing the Benefits of AI

An important consequence of the normal technology view is that progress is not automatic—there are many roadblocks to AI diffusion. As Jeffrey Ding has shown, the capacity to diffuse innovations throughout the economy varies greatly between countries and has a major effect on their overall power and economic growth.¹⁰⁸ As an example of how diffusion can be a bottleneck, recall the example of the electrification of factories described above. Policy can mitigate or worsen these roadblocks.

Realizing the benefits of AI will require experimentation and reconfiguration. Regulation that is insensitive to these needs risks stymying beneficial AI adoption. Regulation tends to create or reify categories, and might thus prematurely freeze business models, forms of organization, product categories, and so forth. The following are a few examples:

Categorizing certain domains as “high-risk,” say insurance, benefits adjudication, or hiring, may be a category error, as the variation in risk among tasks within a domain may be far greater than the variation across domains.¹⁰⁹ Tasks in the same domains might range from automated decision making (highly consequential) to optical character recognition (relatively innocuous). Moreover, the diffusion of AI will surely create new tasks that we have not yet envisioned and which might be preemptively miscategorized by regulation.
The AI supply chain is changing rapidly. The rise of foundation models has led to a much sharper distinction between model developers, downstream developers, and deployers (among many other categories). Regulation that is insensitive to these distinctions risks burdening model developers with responsibilities for risk mitigation related to particular deployment contexts, which would be impossible for them to carry out due to the general-purpose nature of foundation models and the unknowability of all the possible deployment contexts.
When regulation makes a binary distinction between decisions that are and are not fully automated, and does not recognize degrees of oversight, it disincentivizes the adoption of new models for AI control. As we discussed above, there are many new models being proposed for how to have effective human oversight without having a human in the loop in every decision. It would be unwise to define automated decision making in such a way that these approaches incur the same compliance burdens as a system with no oversight at all.

To be clear, regulation versus diffusion is a false tradeoff, just as is regulation versus innovation.¹¹⁰ None of the above examples are arguments against regulation; they only illustrate the need for nuance and flexibility.

Moreover, regulation has a crucial role to play in enabling diffusion. As a historical example, the ESIGN Act of 2000 in the U.S. was instrumental in promoting digitization and e-commerce: Ensuring that electronic signatures and records are legally valid helped build trust in digital transactions.¹¹¹

In AI, too, there are many opportunities for diffusion-enabling regulation. As one example, the incorporation of journalistic and media content into chatbots and other AI interfaces is limited by media organizations’ justified wariness of AI companies. Many of the AI-meets-journalism deals that have been made thus far are exploitative due to the power asymmetry between AI companies and publishers, and the latter’s inability to bargain collectively. Various models for mandatory negotiation with regulatory oversight are possible.¹¹² (Arguably a more important reason for such regulation is to protect the interests of publishers, which we revisit below).

In areas in which there is legal or regulatory uncertainty, regulation can promote diffusion. The application of liability laws to AI is often unclear. For example, this was the case with small drones until the Federal Aviation Administration regulated the nascent industry in 2016, establishing clear rules and requirements. The resulting clarity spurred adoption and led to a rapid rise in the number of registered drones, certified pilots, and use cases across different industries.¹¹³

Moving beyond the government’s role as a regulator, one powerful strategy for promoting AI diffusion is investing in the complements of automation, which are things that become more valuable or necessary as automation increases. One example is promoting AI literacy as well as workforce training in both the public and the private sectors. Another example is digitization and open data, especially open government data, which can allow AI users to benefit from previously inaccessible datasets. The private sector will be likely to underinvest in these areas as they are public goods that everyone can benefit from. Improvements to energy infrastructure, such as the reliability of the grid, will promote both AI innovation and diffusion since it will help in both AI training and inference.

Governments also have an important role to play in redistributing the benefits of AI to make them more equitable and in compensating those who stand to lose as a result of automation. Strengthening social safety nets will help to decrease the currently high levels of public anxiety about AI in many countries.¹¹⁴ The arts and journalism are vital spheres of life that have been harmed by AI. Governments should consider funding them through taxes on AI companies.

Finally, governments should strike a fine balance in terms of the public sector adoption of AI. Moving too quickly will lead to a loss of trust and legitimacy, as was the case of the New York City chatbot that was evidently inadequately tested and made headlines for telling businesses to break the law.¹¹⁵ The use of AI by the U.S. Department of Government Efficiency (DOGE) includes many dubious applications.¹¹⁶ But moving too slowly might mean that basic government functions are outsourced to the private sector where they are implemented with less accountability.¹¹⁷

For example, the complexity of rules in areas such as taxes and welfare means that people often turn to chatbots for guidance on navigating them, and governments currently lag far behind in providing such services due to understandable caution about the risks involved.¹¹⁸

But the administrative state’s approach to these risks is overly cautious and has been described by Nicholas Bagley as a “procedure fetish,” potentially leading to a “runaway bureaucracy.”¹¹⁹ In addition to losing out on the benefits of AI, Bagley cautioned that incompetent performance will lead to government agencies losing the very legitimacy that they seek to gain through their emphasis on procedure and accountability.

Final Thoughts

AI as normal technology is a worldview that stands in contrast to the worldview of AI as impending superintelligence. Worldviews are constituted by their assumptions, vocabulary, interpretations of evidence, epistemic tools, predictions, and (possibly) values. These factors reinforce each other and form a tight bundle within each worldview.

For example, we assume that, despite the obvious differences between AI and past technologies, they are sufficiently similar that we should expect well-established patterns, such as diffusion theory to apply to AI, in the absence of specific evidence to the contrary.

Vocabulary differences can be pernicious because they may hide underlying assumptions. For example, we reject certain assumptions that are required for the meaningfulness of the concept of superintelligence as it is commonly understood.

Differences about the future of AI are often partly rooted in differing interpretations of evidence about the present. For example, we strongly disagree with the characterization of generative AI adoption as rapid (which reinforces our assumption about the similarity of AI diffusion to past technologies).

In terms of epistemic tools, we deemphasize probability forecasting and emphasize the need for disaggregating what we mean by AI (levels of generality, progress in methods versus application development versus diffusion, etc.) when extrapolating from the past to the future.

We believe that some version of our worldview is widely held. Unfortunately, it has not been articulated explicitly, perhaps because it might seem like the default to someone who holds this view, and articulating it might seem superfluous. Over time, however, the superintelligence view has become dominant in AI discourse, to the extent that someone steeped in it might not recognize that there exists another coherent way to conceptualize the present and future of AI. Thus, it might be hard to recognize the underlying reasons why different people might sincerely have dramatically differing opinions about AI progress, risks, and policy. We hope that this paper can play some small part in enabling greater mutual understanding, even if it does not change any beliefs.

Acknowledgments

We are deeply grateful to Gillian Hadfield, Seth Lazar, and our anonymous peer reviewer for detailed comments on our paper during and after the Knight First Amendment Institute workshop on AI and democratic freedoms. We also thank the participants at the workshop, including Alex Abdo, Borhane Blili-Hamelin, Kevin Feng, Henry Farrell, Katy Glenn-Bass, Atoosa Kasirzadeh, Sydney Levine, Nik Marda, Deirdre Mulligan, and Daniel Susskind. We are fortunate to have received feedback on drafts from many other people, including Shazeda Ahmed, Dean Ball, Nicholas Carlini, Alan Chan, Ajeya Cotra, Justin Curl, Jeffrey Ding, Benjamin Edelman, Jobst Heitzig, Noam Kolt, Mihir Kshirsagar, Timothy B. Lee, Steve Newman, David Robinson, Matthew Salganik, Zachary Siegel, Ollie Stephenson, and Zach Vertin. We are grateful to Shira Minsk and Mandy Soulsby-Bodart for editorial support. Finally, we are grateful for feedback from members of the MINT lab at the Australian National University and from the students in the Limits to Prediction course at Princeton University.

References

[1] Nick Bostrom. 2012. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines 22, 2 (May 2012), 71–85; Nick Bostrom. 2017. Superintelligence: Paths, Dangers, Strategies (reprinted with corrections). Oxford University Press, Oxford, United Kingdom; Sam Altman, Greg Brockman, and Ilya Sutskever. 2023. Governance of Superintelligence (May 2023); Shazeda Ahmed et al. 2023. Building the Epistemic Community of AI Safety. SSRN: Rochester, NY.

[2] This is different from the question of whether it is helpful for an individual user to conceptualize a specific AI system as a tool as opposed to a human-like entity such as an intern, a co-worker, or a tutor.

[3] Daron Acemoglu and Simon Johnson. 2023. Power and Progress: Our Thousand-Year Struggle over Technology and Prosperity. PublicAffairs, New York, NY.

[4] Jeffrey Ding. 2024. Technology and the Rise of Great Powers: How Diffusion Shapes Economic Competition. Princeton University Press, Princeton.

[5] Angelina Wang et al. 2023. Against predictive optimization: On the legitimacy of decision-making algorithms that optimize predictive accuracy. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (Chicago, IL, USA: ACM, 2023), 626–26.

[6] Casey Ross. 2022. Epic’s Overhaul of a Flawed Algorithm Shows Why AI Oversight Is a Life-or-Death Issue. STAT.

[7] Andrew Wong et al. 2021. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine 181, 8 (August 2021), 1065–70.

[8] Kevin Roose. 2023. A Conversation With Bing’s Chatbot Left Me Deeply Unsettled. The New York Times (February 2023).

[9] Dan Milmo and Alex Hern. 2024. ‘We definitely messed up’: why did Google AI tool make offensive historical images? The Guardian (March 2024).

[10] Jamie Bernardi et al. 2024. Societal adaptation to advanced AI. arXiv: May 2024; Center for Devices and Radiological Health. 2024. Regulatory evaluation of new artificial intelligence (AI) uses for improving and automating medical practices. FDA (June 2024); “Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying down Harmonised Rules on Artificial Intelligence and Amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA Relevance),” June 2024.

[11] Javier Espinoza. 2024. Europe’s rushed attempt to set the rules for AI. Financial Times (July 2024); Daniel E. Ho and Nicholas Bagley. 2024. Runaway bureaucracy could make common uses of ai worse, even mail delivery. The Hill (January 2024).

[12] Avanidhar Subrahmanyam. 2013. Algorithmic trading, the flash crash, and coordinated circuit breakers. Borsa Istanbul Review 13, 3 (September 2013), 4–9.

[13] Alexander Bick, Adam Blandin, and David J. Deming. 2024. The Rapid Adoption of Generative AI. National Bureau of Economic Research.

[14] Alexander Bick, Adam Blandin, and David J. Deming. 2024. The Rapid Adoption of Generative AI. National Bureau of Economic Research.

[15] Benedict Evans. 2023. AI and the Automation of Work; Benedict Evans, 2023; Jeffrey Ding. 2024. Technology and the Rise of Great Powers: How Diffusion Shapes Economic Competition. Princeton University Press, Princeton.

[16] Paul A. David. 1990. The dynamo and the computer: an historical perspective on the modern productivity paradox. The American Economic Review 80, 2 (1990), 355–61; Tim Harford. 2017. Why didn’t electricity immediately change manufacturing? (August 2017).

[17] Robert Solow as quoted in Paul A. David. 1990. The dynamo and the computer: an historical perspective on the modern productivity paradox. The American Economic Review 80, 2 (1990), Page 355; Tim Harford. 2017. Why didn’t electricity immediately change manufacturing? (August 2017).

[18] Arvind Narayanan and Sayash Kapoor. 2024. AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference. Princeton University Press, Princeton, NJ.

[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012); Harris Drucker, Donghui Wu, and Vladimir N. Vapnik. 1999. Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 5 (September 1999), 1048–54; William D. Smith. 1964. New I.B.M, System 360 can serve business, science and government; I.B.M. Introduces a computer it says tops output of biggest. The New York Times April 1964; Special to THE NEW YORK TIMES. Algebra machine spurs research calling for long calculations; Harvard receives today device to solve in hours problems taking so much time they have never been worked out. The New York Times (August 1944); Herman Hollerith. 1894. The electrical tabulating machine. Journal of the Royal Statistical Society 57, 4 (December 1894), 678.

[20] Mohammad Musa, Tim Dawkins, and Nicola Croce. 2019. This is the next step on the road to a safe self-driving future. World Economic Forum (December 2019); Louise Zhang. 2023. Cruise’s Safety Record Over 1 Million Driverless Miles. Cruise (April 2023).

[21] Arvind Narayanan and Sayash Kapoor. 2024. AI companies are pivoting from creating gods to building products. Good. AI Snake Oil newsletter.

[22] Rich Sutton. 2019. The Bitter Lesson (March 2019).

[23] Arvind Narayanan and Sayash Kapoor. 2024. AI companies are pivoting from creating gods to building products. Good. AI Snake Oil newsletter.

[24] Melanie Mitchell. 2021. Why AI is harder than we think. arXiv preprint (April 2021).

[25] Josh Achiam et al. 2023. GPT-4 technical report. arXiv preprint arXiv: 2303.08774; Peter Henderson et al. 2024. Rethinking machine learning benchmarks in the context of professional codes of conduct. In Proceedings of the Symposium on Computer Science and Law; Varun Magesh et al. 2024. Hallucination-free? Assessing the reliability of leading AI legal research tools. arXiv preprint arXiv: 2405.20362; Daniel N. Kluttz and Deirdre K. Mulligan. 2019. Automated decision support technologies and the legal profession. Berkeley Technology Law Journal 34, 3 (2019), 853–90; Inioluwa Deborah Raji, Roxana Daneshjou, and Emily Alsentzer. 2025. It’s time to bench the medical exam benchmark. NEJM AI 2, 2 (2025).

[26] Sayash Kapoor, Peter Henderson, and Arvind Narayanan. Promises and pitfalls of artificial intelligence for legal applications. Journal of Cross-Disciplinary Research in Computational Law 2, 2 (May 2024), Article 2.

[27] Hamel Husain, Isaac Flath, and Johno Whitaker. Thoughts on a month with Devin. Answer.AI (2025).

[28] Ehud Reiter. 2025. Do LLM Coding Benchmarks Measure Real-World Utility?.

[29] Deborah Raji et al. 2021. AI and the everything in the whole wide world benchmark. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, vol. 1.; Rachel Thomas and David Uminsky. 2020. The problem with metrics is a fundamental problem for AI. arXiv preprint.

[30] Ashwin Nayak et al. 2023. Comparison of history of present illness summaries generated by a chatbot and senior internal medicine residents. JAMA Internal Medicine 183, 9 (September 2023), 1026–27; Shakked Noy and Whitney Zhang. 2023. Experimental evidence on the productivity effects of generative artificial intelligence. Science 381, 6654 (July 2023), 187–92; Fabrizio Dell’Acqua et al., “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality,” Harvard Business School Technology & Operations Mgt. Unit Working Paper, no. 24–13 (2023).

[31] Pranshu Verma and Gerrit De Vynck. 2023. ChatGPT took their jobs. Now they walk dogs and fix air conditioners. Washington Post (June 2023).

[32] Metaculus. 2024. Will there be human-machine intelligence parity before 2040?.

[33] Mario Krenn et al. 2023. Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network. Nature Machine Intelligence 5, 11 (2023), 1326–35.

[34] Johan S.G. Chu and James A. Evans. 2021. Slowed Canonical Progress in Large Fields of Science. Proceedings of the National Academy of Sciences 118, 41 (2021), e2021636118.

[35] Timothy B. Lee. 2024. Predictions of AI doom are too much like Hollywood movie plots. https://www.understandingai.org/p/predictions-of-ai-doom-are-too-much

[36] Celeste Biever. 2023. ChatGPT broke the Turing Test — The race is on for new ways to assess AI. Nature 619, 7971 (July 2023), 686–89; Melanie Mitchell. 2024. The Turing Test and our shifting conceptions of intelligence. Science 385, 6710 (2024), eadq9356.

[37] John McCarthy, Marvin L. Minsky, Nathaniel Rochester and Claude E. Shannon. 1955. A proposal for the dartmouth summer research project on artificial intelligence.

[38] Changmao Li and Jeffrey Flanigan. 2023. Task contamination: Language models may not be few-shot anymore. arXiv: December 2023.

[39] Luke Muehlhauser. 2013. Plenty of room above us. In Facing the Intelligence Explosion.

[40] Melanie Mitchell et al. 2024. Ep. 1: What is intelligence? Complexity. Santa Fe Institute; Podcast episode; Melanie Mitchell. 2019. Opinion. We shouldn’t be scared by ‘Superintelligent A.I.’ The New York Times (October 2019).

[41] Matthew J Salganik et al. 2020. Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences 117, 15 (2020), 8398–8403.

[42] Mary Phuong et al. 2024. Evaluating frontier models for dangerous capabilities. arXiv: April 2024. Page 5.

[43] Mary Phuong et al. 2024. Evaluating frontier models for dangerous capabilities. arXiv: April 2024.

[44] Arvind Narayanan, Sayash Kapoor, and Seth Lazar. 2024. Model alignment protects against accidental harms, not intentional ones.

[45] Raja Parasuraman and Dietrich H. Manzey. 2010. Complacency and bias in human use of automation: An attentional integration. Human Factors 52, 3 (June 2010), 381–410.

[46] Roel I. J. Dobbe. 2022. System safety and artificial intelligence. In The Oxford Handbook of AI Governance, ed. Justin B. Bullock et al., Oxford University Press, Oxford.

[47] CodeMetal.ai. 2024. Combining AI with formal verification for efficient migration of legacy code.

[48] Balint Gyevnar and Atoosa Kasirzadeh. 2025. AI safety for everyone. arXiv preprint arXiv: 2502.09288.

[49] Balint Gyevnar and Atoosa Kasirzadeh. 2025. AI safety for everyone. arXiv preprint arXiv: 2502.09288; Tinghao Xie et al. 2024. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. arXiv: June 2024; Alan Chan et al. 2024. Visibility into AI agents. arXiv preprint arXiv:2401.13138; Yonadav Shavit et al. 2023. Practices for governing agentic AI systems.

[50] Karen Levy. 2022. Data Driven: Truckers, Technology, and the New Workplace Surveillance. Princeton University Press, Princeton, NJ.

[51] Andrew J. Hawkins. 2024. Waymo thinks it can overcome robotaxi skepticism with lots of safety data. The Verge; Caleb Miller. 2024. General motors gives up on its cruise robotaxi dreams. Car and Driver (December 2024); Greg Bensinger. 2021. Why Tesla’s ‘Beta Testing’ Puts the Public at Risk. The New York Times (July 2021); Andrew J. Hawkins. 2020. Uber’s fraught and deadly pursuit of self-driving cars is over. The Verge.

[52] Caleb Miller. 2024. General motors gives up on its cruise robotaxi dreams. Car and Driver (December 2024); Andrew J. Hawkins. 2020. Uber’s fraught and deadly pursuit of self-driving cars is over. The Verge.

[53] Jonathan Stempel. 2024. Tesla must face vehicle owners’ lawsuit over self-driving claims. Reuters (May 2024).

[54] Hayden Field. 2023. Waymo is full speed ahead as safety incidents and regulators stymie competitor cruise.

[55] Will Hunt. 2020. The flight to safety-critical AI: Lessons in AI safety from the aviation industry. CLTC White Paper Series. UC Berkeley Center for Long-Term Cybersecurity.

[56] Arvind Narayanan. 2023. Understanding Social Media Recommendation Algorithms. Knight First Amendment Institute.

[57] Ralph Nader. 1965. Unsafe at Any Speed: The Designed-in Dangers of the American Automobile. Grossman Publishers, New York, NY.

[58] Seth Lazar. 2025. Anticipatory AI ethics (manuscript, forthcoming 2025).

[59] Alex Engler. 2023. The EU and U.S. diverge on AI regulation: A transatlantic comparison and steps to alignment.

[60] Matt Sheehan. 2023. China’s AI regulations and how they get made.

[61] Heather Curry, 2024. 2024 state summary on AI. BSA TechPost (October 2024).

[62] Yuntao Bai et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv: December 2022; Long Ouyang et al.. 2022. Training language models to follow instructions with human feedback. arXiv: March 2022.

[63] Eugene Bagdasaryan et al. 2023. Abusing images and sounds for indirect instruction injection in multi-modal LLMs. arXiv: October 2023; Xiangyu Qi et al. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv: October 2023.

[64] Arvind Narayanan and Sayash Kapoor. 2024. AI safety is not a model property.

[65] Erik Jones, Anca Dragan, and Jacob Steinhardt. 2024. Adversaries can misuse combinations of safe models. arXiv: July 2024.

[66] Arvind Narayanan and Sayash Kapoor. 2024. AI safety is not a model property.

[67] Google. 2024. Email sender guidelines.

[68] Craig Marcho. 2024. IE7 - Introducing the phishing filter. Microsoft Tech Community.

[69] Jennifer Tang, Tiffany Saade, and Steve Kelly. 2024. The implications of artificial intelligence in cybersecurity: shifting the offense-defense balance.

[70] Dongge Liu et al. 2023. AI-Powered Fuzzing: Breaking the Bug Hunting Barrier. Google Online Security Blog.

[71] Juan Cambeiro. How AI can help prevent biosecurity disasters. Institute for Progress (July 2023).

[72] LessWrong. 2008. Squiggle maximizer (formerly “paperclip maximizer”).

[73] Ryan Greenblatt et al. 2024. Alignment faking in large language models.

[74] Bowen Baker et al. 2025. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.

[75] Victoria Krakovna. 2020. Specification gaming: The flip side of AI ingenuity. Google DeepMind (April 2020).

[76] Simon Dima et al. 2024. Non-maximizing policies that fulfill multi-criterion aspirations in expectation. arXiv: August 2024.

[77] Daron Acemoglu and Simon Johnson. 2023. Power and Progress. PublicAffairs.

[78] Atoosa Kasirzadeh. 2024. Two types of AI existential risk: Decisive and accumulative. arXiv: preprint (February 2024).

[79] Anton Leicht. 2024. AI safety politics after the SB-1047 veto.

[80] Timothy B. Lee. 2024. Six Principles for Thinking about AI Risk.

[81] Mary Phuong et al. 2024. Evaluating frontier models for dangerous capabilities.

[82] Shazeda Ahmed et al. 2024. Field-building and the epistemic culture of AI safety. First Monday 29, 4.

[83] Arvind Narayanan and Sayash Kapoor. 2024. AI existential risk probabilities are too unreliable to inform policy; Neel Guha et al. 2023. AI regulation has its own alignment problem: The technical and institutional feasibility of disclosure, registration, licensing, and auditing. SSRN (November 2023).

[84] Christopher A. Mouton, Caleb Lucas, and Ella Guest. 2024. The operational risks of AI in large-scale biological attacks: Results of a red-team study. RAND Corporation; Ari Takanen, Jared D. Demott, and Charles Miller. 2008. Fuzzing for Software Security Testing and Quality Assurance. Fuzzing for Software Security (1st ed.). Artech House Publishers, Norwood, MA.

[85] Sayash Kapoor and Arvind Narayanan. 2023. Licensing is neither feasible nor effective for addressing AI risks (June 2023).

[86] Arvind Narayanan and Sayash Kapoor. 2024. AI existential risk probabilities are too unreliable to inform policy.

[87] Richard Blumenthal and Josh Hawley. 2023. Bipartisan framework for U.S. AI act.

[88] Sigal Samuel. 2022. Effective altruism’s most controversial idea.

[89] Kevin Vallier. 1996. Public justification.

[90] Jeffrey A Friedman and Richard Zeckhauser. 2018. Analytic confidence and political decision-making: Theoretical principles and experimental evidence from national security professionals. Political Psychology 39, 5 (2018), 1069–87.

[91] Arvind Narayanan and Sayash Kapoor. 2023. Generative AI companies must publish transparency reports. Knight First Amendment Institute; Executive Office of the President. 2020. Promoting the use of trustworthy artificial intelligence in the federal government, 2020; Justin Colannino. 2021. The copyright office expands your security research rights. GitHub Blog.

[92] AI Incident Database. https://incidentdatabase.ai/.

[93] Stephen Casper, David Krueger, and Dylan Hadfield-Menell. 2025. Pitfalls of evidence-based AI policy.

[94] Sayash Kapoor et al. 2024. On the societal impact of open foundation models.

[95] Gary E. Marchant and Yvonne A. Stevens. 2017. Resilience: A new tool in the risk governance toolbox for emerging technologies. UC Davis Law Review.

[96] Brian Walker et al. 2006. A handful of heuristics and some propositions for understanding resilience in social-ecological systems. Ecology and Society 11, 1 (2006).

[97] Gary E. Marchant and Yvonne A. Stevens. 2017. Resilience: A new tool in the risk governance toolbox for emerging technologies. UC Davis Law Review.

[98] Rishi Bommasani et al. 2024. A path for science- and evidence-based AI policy; Balint Gyevnar and Atoosa Kasirzadeh. 2025. AI safety for everyone; Anka Reuel et al. 2024. Position: Technical research and talent is needed for effective AI governance. In Proceedings of the 41st International Conference on Machine Learning (PMLR, 2024), 42543–57.

[99] The National Artificial Intelligence Advisory Committee (NAIAC). 2023. Improve monitoring of emerging risks from AI through adverse event reporting. (November 2023); Shayne Longpre et al. 2024. A safe harbor for AI evaluation and red teaming (March 2024); Jamie Bernardi et al. 2025. Societal adaptation to advanced AI. Helen Toner. 2024. Oversight of AI: Insiders’ perspectives (September 2024).

[100] Sayash Kapoor and Rishi Bommasani et al. 2024. On the societal impact of open foundation models; Rishi Bommasani et al. 2024. Considerations for Governing Open Foundation Models. Science 386, 6718 (October 2024), 151–53; Gary E. Marchant and Yvonne A. Stevens. 2017. Resilience; Noam Kolt. 2024. Algorithmic black swans. Washington University Law Review.

[101] Richard Blumenthal and Josh Hawley. 2023. Bipartisan framework for U.S. AI act; Josh Hawley. 2025. Decoupling America’s artificial intelligence capabilities from China Act of 2025. Pub. L. No. S 321 (2025).

[102] Sayash Kapoor and Arvind Narayanan. 2023. Licensing is neither feasible nor effective for addressing AI risks.

[103] Eliezer Yudkowsky. 2023. Pausing AI developments isn’t enough. we need to shut it all down. (March 2023).

[104] Reuters. 2005. New Internet worm targeting Windows. NBC News (August 2005).

[105] Christopher A. Mouton, Caleb Lucas, and Ella Guest. 2024. The operational risks of AI in large-scale biological attacks.

[106] Dan Hendrycks, Eric Schmidt, and Alexandr Wang. 2025. Superintelligence strategy: Expert version. arXiv: preprint arXiv:2503.05628.

[107] Emanuel Maiberg. 2024. Apple removes nonconsensual AI nude apps following 404 Media investigation.

[108] Jeffrey Ding. 2024. Technology and the Rise of Great Powers: How Diffusion Shapes Economic Competition. Princeton University Press, Princeton..

[109] Olivia Martin et al. 2024, The spectrum of AI integration: The case of benefits adjudication. In Artificial Intelligence: Legal Issues, Policy & Practical Strategies, Cynthia H. Cwik (ed.).

[110] Anu Bradford. The false choice between digital regulation and innovation. Nw. UL Rev. 119 (2024), 377.

[111] Scott R. Zemnick. 2001. The E-Sign Act: The Means to Effectively Facilitate the Growth and Development of E-commerce. Chicago-Kent Law Review (April 2001). https://scholarship.kentlaw.iit.edu/cgi/viewcontent.cgi?article=3342&context=cklawreview.

[112] Benjamin Brooks. 2024. AI search could break the web. MIT Technology Review (October 2024).

[113] Drones Are Here to Stay. Get Used to It. 2018. Time (May 2018).

[114] Ipsos. 2024. The Ipsos AI Monitor 2024: Changing attitudes and feelings about AI and the future it will bring.

[115] Colin Lecher. 2024. NYC’s AI chatbot tells businesses to break the law. The Markup.

[116] Courtney Kube et al. 2025. DOGE will use AI to assess the responses of federal workers who were told to justify their jobs via email. NBC News (February 2025); Dell Cameron. 2025. Democrats demand answers on DOGE’s use of AI.

[117] Dean W. Ball. 2021. How California turned on its own citizens.

[118] Kate Dore. 2024. ‘Proceed with caution’ before tapping AI chatbots to file your tax return, experts warn. CNBC (April 2024).

[119] Nicholas Bagley. 2021. The procedure fetish. Niskanen Center; Daniel E. Ho and Nicholas Bagley. 2024. Runaway bureaucracy could make common uses of AI worse, even mail delivery.

Is AI progress slowing down?

Arvind Narayanan — Wed, 18 Dec 2024 16:47:58 GMT

By Arvind Narayanan, Benedikt Ströbl, and Sayash Kapoor.

After the release of GPT-4 in March 2023, the dominant narrative in the tech world was that continued scaling of models would lead to artificial general intelligence and then superintelligence. Those extreme predictions gradually receded, but up until a month ago, the prevailing belief in the AI industry was that model scaling would continue for the foreseeable future.

Then came three back-to-back news reports from The Information, Reuters, and Bloomberg revealing that three leading AI developers — OpenAI, Anthropic, and Google Gemini — had all run into problems with their next-gen models. Many industry insiders, including Ilya Sutskever, probably the most notable proponent of scaling, are now singing a very different tune:

“The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.” (Reuters)

The new dominant narrative seems to be that model scaling is dead, and “inference scaling”, also known as “test-time compute scaling” is the way forward for improving AI capabilities. The idea is to spend more and more computation when using models to perform a task, such as by having them “think” before responding.

This has left AI observers confused about whether or not progress in AI capabilities is slowing down. In this essay, we look at the evidence on this question, and make four main points:

Declaring the death of model scaling is premature.
Regardless of whether model scaling will continue, industry leaders’ flip flopping on this issue shows the folly of trusting their forecasts. They are not significantly better informed than the rest of us, and their narratives are heavily influenced by their vested interests.
Inference scaling is real, and there is a lot of low-hanging fruit, which could lead to rapid capability increases in the short term. But in general, capability improvements from inference scaling will likely be both unpredictable and unevenly distributed among domains.
The connection between capability improvements and AI’s social or economic impacts is extremely weak. The bottlenecks for impact are the pace of product development and the rate of adoption, not AI capabilities.

Is model scaling dead?

There is very little new information that has led to the sudden vibe shift. We’ve long been saying on this newsletter that there are important headwinds to model scaling. Just as we cautioned back then about scaling hype, we must now caution against excessive pessimism about model scaling.

“Scaling as usual” ended with GPT-4 class models, because these models are trained on most of the readily available data sources. We already knew that new ideas would be needed to keep model scaling going. So unless we have evidence that many such ideas have been tried and failed, we can’t conclude that there isn’t more mileage to model scaling.

As just one example, it is possible that including YouTube videos — the actual videos, not transcribed text — in the training mix for multimodal models will unlock new capabilities. Or it might not help; we just won’t know until someone tries it, and we don’t know if it has been tried or not. Note that it would probably have to be Google, because the company is unlikely to license YouTube training data to competitors.1

If things are still so uncertain regarding model scaling, why did the narrative flip? Well, it’s been over two years since GPT-4 finished training, so the idea that next-gen models are simply taking a bit longer than expected was becoming less and less credible. And once one company admits that there are problems, it becomes a lot easier for others to do so. Once there is a leak in the dam, it quickly bursts. Finally, now that OpenAI’s reasoning model o1 is out, it has given companies an out when admitting that they have run into problems with model scaling, because they can save face by claiming that they will simply switch to inference scaling.

To be clear, there is no reason to doubt the reports saying that many AI labs have conducted larger training runs and yet not released the resulting models. But it is less clear what to conclude from it. Some possible reasons why bigger models haven’t been released include:

Technical difficulties, such as convergence failures or complications in achieving fault tolerance in multi-datacenter training runs.
The model was not much better than GPT-4 class models, and so would be too underwhelming to release.
The model was not much better than GPT-4 class models, and so the developer has been spending a long time trying to eke out better performance through fine tuning.

To summarize, it’s possible that model scaling has indeed reached its limit, but it’s also possible that these hiccups are temporary and eventually one of the companies will find ways to overcome them, such as by fixing any technical difficulties and/or finding new data sources.

Let’s stop deferring to insiders

Not only is it strange that the new narrative emerged so quickly, it’s also interesting that the old one persisted for so long, despite the potential limitations of model scaling being obvious. The main reason for its persistence is the assurances of industry leaders that scaling would continue for a few more years.2 In general, journalists (and most others) tend to defer to industry insiders over outsiders. But is this deference justified?

Industry leaders don’t have a good track record of predicting AI developments. A good example is the overoptimism about self-driving cars for most of the last decade. (Autonomous driving is finally real, though Level 5 — full automation — doesn’t exist yet.) As an aside, in order to better understand the track record of insider predictions, it would be interesting to conduct a systematic analysis of all predictions about AI made in the last 10 years by prominent industry insiders.

There are some reasons why we might want to give more weight to insiders’ claims, but also important reasons to give less weight to them. Let’s analyze these one by one. It is true that industry insiders have proprietary information (such as the performance of as-yet-unreleased models) that might make their claims about the future more accurate. But given how many AI companies are close to the state of the art, including some that openly release model weights and share scientific insights, datasets, and other artifacts, we’re talking about an advantage of at most a few months, which is minor in the context of, say, 3-year forecasts.

Besides, we tend to overestimate how much additional information companies have on the inside — whether in terms of capability or (especially) in terms of safety. Insiders warned for a long time that “if only you know what we know...” but when whistleblowers finally came forward, it turns out that they were mostly relying on the same kind of speculation that everyone else does.3

Another potential reason to give more weight to insiders is their technical expertise. We don’t think this is a strong reason: there is just as much AI expertise in academia as in industry. More importantly, deep technical expertise isn’t that important to support the kind of crude trend extrapolation that goes into AI forecasts. Nor is technical expertise enough — business and social factors play at least as big a role in determining the course of AI. In the case of self-driving cars, one such factor is the extent to which societies tolerate public roads being used for experimentation. In the case of large AI models, we’ve argued before that the most important factor is whether scaling will make business sense, not whether it is technically feasible. So not only do techies not have much of an advantage, their tendency to overemphasize the technical dimensions tends to result in overconfident predictions.

In short, the reasons why one might give more weight to insiders’ views aren’t very important. On the other hand, there’s a huge and obvious reason why we should probably give less weight to their views, which is that they have an incentive to say things that are in their commercial interests, and have a track record of doing so.

As an example, Sutskever had an incentive to talk up scaling when he was at OpenAI and the company needed to raise money. But now that he heads the startup Safe Superintelligence, he needs to convince investors that it can compete with OpenAI, Anthropic, Google, and others, despite having access to much less capital. Perhaps that is why he is now talking about running out of data for pre-training, as if it were some epiphany and not an endlessly repeated point.

To reiterate, we don’t know if model scaling has ended or not. But the industry’s sudden about-face has been so brazen that it should leave no doubt that insiders don’t have any kind of crystal ball and are making similar guesses as everyone else, and are further biased by being in a bubble and readily consuming the hype they sell to the world.

In light of this, our suggestion — to everyone, but especially journalists, policymakers, and the AI community — is to end the deference to insiders' views when they predict the future of technology, especially its societal impacts. This will take effort, as there is a pervasive unconscious bias in the U.S., in the form of a “distinctly American disease that seems to equate extreme wealth, and the power that comes with it, with virtue and intelligence.” (from Bryan Gardiner’s review of Marietje Schake’s The Tech Coup.)

Will progress in capabilities continue through inference scaling?

Of course, model scaling is not the only way to improve AI capabilities. Inference scaling is an area with a lot of recent progress. For example, OpenAI’s o1 and the open-weights competitor DeepSeek R1 are reasoning models: they have been fine tuned to “reason” before providing an answer. Other methods leave the model itself unchanged but employ tricks like generating many solutions and ranking them by quality.

There are two main open questions about inference scaling that will determine how significant of a trend it will be.

What class of problems does it work well on?
For problems where it does work well, how much of an improvement is possible by doing more computation during inference?

The per-token output cost of language models has been rapidly decreasing due to both hardware and algorithmic improvements, so if inference scaling yields improvements over many orders of magnitude — for example, generating a million tokens on a given task yields significantly better performance than generating a hundred thousand tokens — that would be a big deal.4

The straightforward, intuitive answer to the first question is that inference scaling is useful for problems that have clear correct answers, such as coding or mathematical problem solving. In such tasks, at least one of two related things tend to be true. First, symbolic reasoning can improve accuracy. This is something LLMs are bad at due to their statistical nature, but can overcome by using output tokens for reasoning, much like a person using pen and paper to work through a math problem. Second, it is easier to verify correct solutions than to generate them (sometimes aided by external verifiers, such as unit tests for coding or proof checkers for mathematical theorem proving).

In contrast, for tasks such as writing or language translation, it is hard to see how inference scaling can make a big difference, especially if the limitations are due to the training data. For example, if a model works poorly in translating to a low-resource language because it isn’t aware of idiomatic phrases in that language, the model can’t reason its way out of this.

The early evidence we have so far, while spotty, is consistent with this intuition. Focusing on OpenAI o1, it improves compared to state-of-the-art language models such as GPT-4o on coding, math, cybersecurity, planning in toy worlds, and various exams. Improvements in exam performance seem to strongly correlate with the importance of reasoning for answering questions, as opposed to knowledge or creativity: big improvements for math, physics and LSATs, smaller improvements for subjects like biology and econometrics, and negligible improvement for English.

Tasks where o1 doesn’t seem to lead to an improvement include writing, certain cybersecurity tasks (which we explain below), avoiding toxicity, and an interesting set of tasks at which thinking is known to make humans worse.

We have created a webpage compiling the available evidence on how reasoning models compare against language models. We plan to keep it updated for the time being, though we expect that the torrent of findings will soon become difficult to keep up with.

Now let’s consider the second question: how large of an improvement can we get through inference scaling, assuming we had an infinite inference compute budget.

OpenAI’s flagship example to show off o1’s capabilities was AIME, a math benchmark. Their graph leaves this question tantalizingly open. Is the performance about to saturate, or can it be pushed close to 100%? Also note that the graph conveniently leaves out x-axis labels.

Source: OpenAI

An attempt by external researchers to reconstruct this graph shows that (1) the cutoff for the x-axis is probably around 2,000 tokens, and (2) when o1 is asked to think longer than this, it doesn’t do so. So the question remains unanswered, and we need to wait for experiments using open-source models to get more clarity. It is great to see that there are vigorous efforts to publicly reproduce the techniques behind o1.

In a recent paper called Inference Scaling fLaws (the title is a pun on inference scaling laws), we look at a different approach to inference scaling — repeatedly generating solutions until one of them is judged as correct by an external verifier. While this approach has been associated with hopes of usefully increasing scaling by many orders of magnitude (including by us in our own past work), we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts.

Generally speaking, the evidence for inference scaling “laws” is not convincing, and it remains to be seen if there are real-world problems where generating (say) millions of tokens at inference time will actually help.

Is inference scaling the next frontier?

There is a lot of low-hanging fruit for inference scaling, and progress in the short term is likely to be rapid. Notably, one current limitation of reasoning models is that they don’t work well in agentic systems. We have observed this in our own benchmark CORE-Bench that asks agents to reproduce the code provided with research papers — the best performing agent scores 38% with Claude 3.5 Sonnet compared to only 24% with o1-mini.5 This also explains why reasoning models led to an improvement in one cybersecurity eval but not another — one of them involved agents.

We think there are two reasons why agents don’t seem to benefit from reasoning models. Such models require different prompting styles than regular models, and current agentic systems are optimized for prompting regular models. Second, as far as we know, reasoning models so far have not been trained using reinforcement learning in a setting where they receive feedback from the environment — be it code execution, shell interaction, or web search. In other words, their tool use ability is no better than the underlying model before learning to reason.6

These seem like relatively straightforward problems. Solving them might enable significant new AI agent capabilities — for example, generating complex, fully functional apps from a prompt. (There are already tools that try to do this, but they don’t work well.)

But what about the long run? Will inference scaling lead to the same kind of progress we’ve seen with model scaling over the last 7 years? Model scaling was so exciting because you “merely” needed to make data, model size, and compute bigger; no algorithmic breakthroughs were needed.

That’s not true (so far) with inference scaling — there’s a long list of inference scaling techniques, and what works or doesn’t work is problem-dependent, and even collectively, they only work in a circumscribed set of domains. AI developers are trying to overcome this limitation. For example, OpenAI’s reinforcement finetuning service is thought to be a way for the company to collect customer data from many different domains for fine-tuning a future model.

About a decade ago, reinforcement learning (RL) led to breakthroughs in many games like Atari. There was a lot of hype, and many AI researchers hoped we could RL our way to AGI. In fact, it was the high expectations around RL that led to the birth of explicitly AGI-focused labs, notably OpenAI. But those techniques didn’t generalize beyond narrow domains like games. Now there is similar hype about RL again. It is obviously a very powerful technique, but so far we’re seeing limitations similar to the ones that led to the dissipation of the previous wave of hype.

It is impossible to predict whether progress in AI capabilities will slow down. In fact, forget prediction — reasonable people can have very different opinions on whether AI progress has already slowed down, because they can interpret the evidence very differently. That’s because “capability” is a construct that’s highly sensitive to how you measure it.

What we can say with more confidence is that the nature of progress in capabilities will be different with inference scaling than with model scaling. In the last few years, newer models predictably brought capability improvements each year across a vast swath of domains. There was a feeling of pessimism among many AI researchers outside the big labs that there was little to do except sit around and wait for the next state-of-the-art LLM to be released.

With inference scaling, capability improvements will likely be uneven and less predictable, driven more by algorithmic advances than investment in hardware infrastructure. Many ideas that were discarded during the reign of LLMs, such as those from the old planning literature, are now back in the mix, and the scene seems intellectually more vibrant than in the last few years.

Product development lags capability increase

The furious debate about whether there is a capability slowdown is ironic, because the link between capability increases and the real-world usefulness of AI is extremely weak. The development of AI-based applications lags far behind the increase of AI capabilities, so even existing AI capabilities remain greatly underutilized. One reason is the capability-reliability gap — even when a certain capability exists, it may not work reliably enough that you can take the human out of the loop and actually automate the task (imagine a food delivery app that only works 80% of the time). And the methods for improving reliability are often application-dependent and distinct from methods for improving capability. That said, reasoning models also seem to exhibit reliability improvements, which is exciting.

Here are a couple of analogies that help illustrate why it might take a decade or more to build products that fully take advantage of even current AI capabilities. The technology behind the internet and the web mostly solidified in the mid-90s. But it took 1-2 more decades to realize the potential of web apps. Or consider this thought-provoking essay that argues that we need to build GUIs for large language models, which will allow interacting with them with far higher bandwidth than through text. From this perspective, the current state of AI-based products is analogous to PCs before the GUI.

The lag in product development is compounded by the fact that AI companies have not paid nearly enough attention to product aspects, believing that the general-purpose nature of AI somehow grants an exemption from the hard problems of software engineering. Fortunately, this has started to change recently.

Now that they are focusing on products, AI companies as well as their users are re-discovering that software development, especially the user experience side of it, is hard, and requires a broader set of skills than AI model development. A great example is the fact that there are now two different ways to run Python code with ChatGPT (which is one of the most important capabilities for power users) and there is an intricate set of undocumented rules to remember regarding the capabilities and limitations of each of them. Simon Willison says:

Do you find this all hopelessly confusing? I don’t blame you. I’m a professional web developer and a Python engineer of 20+ years and I can just about understand and internalize the above set of rules.

Still, this is a big improvement over a week ago, when these models had powerful coding capabilities yet did not come with the ability to run code that could use the internet! And even now, o1 can neither access the internet nor run code. From the perspective of AI impacts, what matters far more than capability improvement at this point is actually building products that let people do useful things with existing capabilities.

Finally, while product development lags behind capability, the adoption of AI-based products further lags far behind product development, for various behavioral, organizational, and societal reasons. Those interested in AI’s impacts (whether positive or negative) should pay much more attention to these downstream aspects than current or predicted capabilities.

Concluding thoughts

Maybe model scaling is over; maybe not. But it won’t continue forever, and the end of model scaling brings a long list of positives: AI progress once again depends on new ideas and not just compute; big companies, startups, and academic researchers can all compete on a relatively even playing field; regulation based on arbitrary training compute thresholds becomes even harder to defend; and there is a clear recognition that models themselves are just a technology, not a product.

As for the future of AI, it is clear that tech insiders are trying to figure it out just like the rest of us, and it is past time to stop trusting their overconfident, self-serving, shifting, and conveniently vague predictions. And when we move beyond technical predictions to claims about AI’s impact on the world, there’s even less reason to trust industry leaders.

Acknowledgment. We are grateful to Zachary S. Siegel for feedback on a draft.

While OpenAI is known to have crawled YouTube in the past, that was a small sliver of YouTube; it won’t be possible to crawl all of YouTube without Google noticing.

A nice analysis by Epoch AI showed that scaling could continue until 2030. But this was published too recently (August 2024) to have been the anchor for the scaling narrative.

We are referring to substantive knowledge about the safety of AI models and systems; whistleblowers did bring forth new knowledge about safety-related processes at OpenAI.

That said, we can’t take future cost decreases for granted; we are also running into fundamental limits of inference cost-saving techniques like quantization.

We set a cost limit for $4 for all models. On a small sample, with a $10 cost limit, o1-preview performed very poorly (10% accuracy). Given cost constraints, we did not evaluate the model with a higher cost limit on the entire data.

o1 doesn’t even have access to tools during inference in the ChatGPT interface! Gemini Flash 2.0 does, but it is not clear if this is a model that has been fine tuned for reasoning, let alone fine tuned for tool use.

We Looked at 78 Election Deepfakes. Political Misinformation is not an AI Problem.

Sayash Kapoor — Fri, 13 Dec 2024 20:51:29 GMT

AI-generated misinformation was one of the top concerns during the 2024 U.S. presidential election. In January 2024, the World Economic Forum claimed that “misinformation and disinformation is the most severe short-term risk the world faces” and that “AI is amplifying manipulated and distorted information that could destabilize societies.” News headlines about elections in 2024 tell a similar story:

In contrast, in our past writing, we predicted that AI would not lead to a misinformation apocalypse. When Meta released its open-weight large language model (called LLaMA), we argued that it would not lead to a tidal wave of misinformation. And in a follow-up essay, we pointed out that the distribution of misinformation is the key bottleneck for influence operations, and while generative AI reduces the cost of creating misinformation, it does not reduce the cost of distributing it. A few other researchers have made similar arguments.

Which of these two perspectives better fits the facts?

Fortunately, we have the evidence of AI use in elections that took place around the globe in 2024 to help answer this question. Many news outlets and research projects have compiled known instances of AI-generated text and media and their impact. Instead of speculating about AI’s potential, we can look at its real-world impact to date.

We analyzed every instance of AI use in elections collected by the WIRED AI Elections Project, which tracked known uses of AI for creating political content during elections taking place in 2024 worldwide. In each case, we identified what AI was used for and estimated the cost of creating similar content without AI.

We find that (1) half of AI use isn't deceptive, (2) deceptive content produced using AI is nevertheless cheap to replicate without AI, and (3) focusing on the demand for misinformation rather than the supply is a much more effective way to diagnose problems and identify interventions.

To be clear, AI-generated synthetic content poses many real dangers: the creation of non-consensual images of people and child sexual abuse material and the enabling of the liar’s dividend, which allows those in power to brush away real but embarrassing or controversial media content about them as AI-generated. These are all important challenges. This essay is focused on a different problem: political misinformation.1

Improving the information environment is a difficult and ongoing challenge. It’s understandable why people might think AI is making the problem worse: AI does make it possible to fabricate false content. But that has not fundamentally changed the landscape of political misinformation.

Paradoxically, the alarm about AI might be comforting because it positions concerns about the information environment as a discrete problem with a discrete solution. But fixes to the information environment depend on structural and institutional changes rather than on curbing AI-generated content.

Half of the Deepfakes in 2024 Elections weren’t Deceptive

We analyzed all 78 instances of AI use in the WIRED AI Elections Project (source for our analysis).2 We categorized each instance based on whether there was deceptive intent. For example, if AI was used to generate false media depicting a political candidate saying something they didn't, we classified it as deceptive. On the other hand, if a chatbot gave an incorrect response to a genuine user query, a deepfake was created for parody or satire, or a candidate transparently used AI to improve their campaigning materials (such as by translating a speech into a language they don't speak), we classify it as non-deceptive.

To our surprise, there was no deceptive intent in 39 of the 78 cases in the database.

The most common non-deceptive use of AI was for campaigning. When candidates or supporters used AI for campaigning, in most cases (19 out of 22), the apparent intent was to improve campaigning materials rather than mislead voters with false information.

We even found examples of deepfakes that we think helped improve the information environment. In Venezuela, journalists used AI avatars to avoid government retribution when covering news adversarial to the government. In the U.S., a local news organization from Arizona, Arizona Agenda, used deepfakes to educate viewers about how easy it is to manipulate videos. In California, a candidate with laryngitis lost his voice, so he transparently used AI voice cloning to read out typed messages in his voice during meet-and-greets with voters.

Reasonable people can disagree on whether using AI in campaigning materials is legitimate or what the appropriate guardrails need to be. But using AI for campaign materials in non-deceptive ways (for example, when AI is used as a tool to improve voter outreach) is much less problematic than deploying AI-generated fake news to sway voters.

Of course, not all non-deceptive AI-generated political content is benign.3 Chatbots often incorrectly answer election-related questions. Rather than deceptive intent, this results from the limitations of chatbots, such as hallucinations and lack of factuality. Unfortunately, these limitations are not made clear to users, leading to an overreliance on flawed large language models (LLMs).4

Making Deceptive Political Misinformation Does Not Require AI

For each of the 39 examples of deceptive intent, where AI use was intended to make viewers believe outright false information, we estimated the cost of creating similar content without AI—for example, by hiring Photoshop experts, video editors, or voice actors. In each case, the cost of creating similar content without AI was modest—no more than a few hundred dollars. (We even found that a video involving a hired stage actor was incorrectly marked as being AI-generated in WIRED’s election database.)

In fact, it has long been possible to create media with outright false information without using AI or other fancy tools. One video used stage actors to falsely claim that U.S. Vice President and Democratic presidential candidate Kamala Harris was involved in a hit-and-run incident. Another slowed down the vice president's speech to make it sound like she was slurring her words. An edited video of Indian opposition candidate Rahul Gandhi showed him saying that the incumbent Narendra Modi would win the election. In the original video, Gandhi said his opponent would not win the election, but it was edited using jump cuts to take out the word “not.” Such media content has been called “cheap fakes” (as opposed to AI-generated “deepfakes”).

There were many instances of cheap fakes used in the 2024 U.S. election. The News Literacy Project documented known misinformation about the election and found that cheap fakes were used seven times more often than AI-generated content. Similarly, in other countries, cheap fakes were quite prevalent. An India-based fact checker reviewed an order of magnitude more cheap fakes and traditionally edited media compared to deepfakes. In Bangladesh, cheap fakes were over 20 times more prevalent than deepfakes.

Let’s consider two examples to analyze how cheap fakes could have led to substantially similar effects as the deepfakes that got a lot of media attention: Donald Trump’s use of Taylor Swift deepfakes to campaign and a voice-cloned robocall that imitated U.S. President Joe Biden in the New Hampshire primary asking voters not to vote.

A Truth Social post shared by Donald Trump with images of Taylor Swift fans wearing “Swifties for Trump” t-shirts. Top left: A post with many AI-generated images of women wearing “Swifties for Trump” t-shirts, with a “satire” label. Top right: A real image of Trump supporter Jenna Piwowarczyk wearing a “Swifties for Trump” t-shirt. Bottom left: A fabricated image of Taylor Swift in front of the American flag with the caption, “Taylor wants you to vote for Donald Trump.” It is unclear if the image was created using AI or other editing software. Bottom right: A Twitter post with two images: one AI-generated, the other real, of women wearing “Swifties for Trump” t-shirts.

Trump’s use of Swift deepfakes implied that Taylor Swift had endorsed him and that Swift fans were attending his rallies en masse. In the wake of the post, many media outlets blamed AI for the spread of misinformation.

But recreating similar images without AI is easy. Images depicting Swift’s support could be created by photoshopping text endorsing Trump onto any of her existing images. Likewise, getting images of Trump supporters wearing “Swifties for Trump” t-shirts could be achieved by distributing free t-shirts at a rally—or even selectively reaching out to Swift fans at Trump rallies. In fact, two of the images Trump shared were real images of a Trump supporter who is also a Swift fan.

Another incident that led to a brief panic was an AI clone of President Joe Biden’s voice that asked people not to vote in the New Hampshire primary.

News headlines in the wake of the Biden robocall.

Rules against such robocalls have existed for years. In fact, the perpetrator of this particular robocall was fined $6 million by the Federal Communications Commission (FCC). The FCC has tiplines to report similar attacks, and it enforces rules around robocalls frequently, regardless of whether AI is used. Since the robocall used a static recording, it could have been made about as easily without using AI—for instance, by hiring voice impersonators.

It is also unclear what impact the robocall had: The efficacy of the deepfake depends on the recipient believing that the president of the United States is personally calling them on the phone to ask them not to vote in a primary.

Is it just a matter of time until improvements in technology and the expertise of actors seeking to influence elections lead to more effective AI disinformation? We don’t think so. In the next section, we point out that structural reasons that drive the demand for misinformation are not aided by AI. We then look at the history of predictions about coming waves of AI disinformation that have accompanied the release of new tools—predictions that have not come to pass.

The Demand for Misinformation

Misinformation can be seen through the forces of supply and demand. The supply comes from people who want to make a buck by generating clicks, partisans who want their side to win, or state actors who want to conduct influence operations. Interventions so far have almost entirely tried to curb the supply of misinformation while leaving the demand unchanged.

The focus on AI is the latest example of this trend. Since AI reduces the cost of generating misinformation to nearly zero, analysts who look at misinformation as a supply problem are very concerned. But analyzing the demand for misinformation can clarify how misinformation spreads and what interventions are likely to help.

Looking at the demand for misinformation tells us that as long as people have certain worldviews, they will seek out and find information consistent with those views. Depending on what someone’s worldview is, the information in question is often misinformation—or at least would be considered misinformation by those with differing worldviews.

In other words, successful misinformation operations target in-group members—people who already agree with the broad intent of the message. Such recipients may have lower skepticism for messages that conform to their worldviews and may even be willing to knowingly amplify false information. Sophisticated tools aren’t needed for misinformation to be effective in this context. On the flip side, it will be extremely hard to convince out-group members of false information that they don't agree with, regardless of AI use.

Seen in this light, AI misinformation plays a very different role from its popular depiction of swaying voters in elections. Increasing the supply of misinformation does not meaningfully change the dynamics of the demand for misinformation since the increased supply is competing for the same eyeballs. Moreover, the increased supply of misinformation is likely to be consumed mainly by a small group of partisans who already agree with it and heavily consume misinformation rather than to convince a broader swath of the public.

This also explains why cheap fakes such as media from unrelated events, traditional video edits such as jump cuts, or even video game footage can be effective for propagating misinformation despite their low quality: It is much easier to convince someone of misinformation if they already agree with its message.

Our analysis of the demand for misinformation may be most applicable to countries with polarized close races where leading parties have similar capacities for voter outreach, so that voters’ (mis)information demands are already saturated.

Still, to our knowledge, in every country that held elections in 2024 so far, AI misinformation had much less impact than feared. In India, deepfakes were used for trolling more than spreading false information. In Indonesia, the impact of AI wasn't to sow false information but rather to soften the image of then-candidate, now-President Prabowo Subianto (a former general accused of many past human rights abuses) using AI-generated digital cartoon avatars that depicted him as likable.5

Why Do Concerns About AI Misinformation Keep Recurring?

The 2024 election cycle wasn’t the first time when there was widespread fear that AI deepfakes would lead to rampant political misinformation. Strikingly similar concerns about AI were expressed before the 2020 U.S. election, though these concerns were not borne out. The release of new AI tools is often accompanied by worries that it will unleash new waves of misinformation:

2019. When OpenAI released its GPT-2 series of models in 2019, one of the main reasons it held back on releasing the model weights for the most capable models in the series was its alleged potential to generate misinformation.
2023. When Meta released the LLaMA model openly in 2023, multiple news outlets reported concerns that it would trigger a deluge of AI misinformation. These models were far more powerful than the GPT-2 models released by OpenAI in 2019. Yet, we have not seen evidence of large-scale voter persuasion attributed to using LLaMA or other large language models.
2024. Most recently, the widespread availability of AI image editing tools on smartphones has prompted similar concerns.

Source, Primary source

In fact, concerns about using new technology to create false information go back over a century. The late 19th and early 20th centuries saw the advent of technologies for photo retouching. This was accompanied by concerns that retouched photographs would be used to deceive people, and, in 1912, a bill was introduced in the U.S. that would have criminalized photo editing without subjects’ consent. (It died in the Senate.)

Thinking of political misinformation as a technological (or AI) problem is appealing because it makes the solution seem tractable. If only we could roll back harmful tech, we could drastically improve the information environment!

While the goal of improving the information environment is laudable, blaming technology is not a fix. Political polarization has led to greater mistrust of the media. People prefer sources that confirm their worldview and are less skeptical about content that fits their worldview. Another major factor is the drastic decline of journalism revenues in the last two decades—largely driven by the shift from traditional to social media and online advertising. But this is more a result of structural changes in how people seek out and consume information than the specific threat of misinformation shared online.

As history professor Sam Lebovic has pointed out, improving the information environment is inextricably linked to the larger project of shoring up democracy and its institutions. There’s no quick technical fix, or targeted regulation, that can “solve” our information problems. We should reject the simplistic temptation to blame AI for political misinformation and confront the gravity of the hard problem.

Correction: A previous version of the essay’s introduction stated that most AI use is not deceptive. In fact, 39 of 78 articles in the database are examples of non-deceptive AI use, or 39 out of 74 if we restrict ourselves to political communication and set aside the 4 instances that are scams.

This essay is cross-posted to the Knight First Amendment Institute website. We are grateful to Katy Glenn Bass for her feedback.

The terms mis- and disinformation lack agreed-upon definitions. In this piece, we use the term misinformation to refer to outright false information, as opposed to issues of misleading interpretive framing. Despite many people’s perception of outgroup narratives as “misinformation,” we don't think the misinformation lens is a useful way to think about differences in framing and narratives; we're more narrowly concerned about using outright false information to support those narratives.

The low number of total deepfakes found in elections worldwide is surprising on its own terms. The small number could either indicate that AI deepfakes are a much smaller problem so far than anticipated or that the database has many missing entries. Still, other databases that tracked election deepfakes have a similar count for the total number of deepfakes; for example, the German Marshall Fund’s list of deepfakes related to 2024 elections worldwide has 133 entries, though it started collecting entries in September 2023. As we note further along in the essay, the News Literacy Project documented known misinformation about the 2024 elections and found that cheap fakes that didn't use AI were used seven times more often than AI-generated content.

The dataset also included four instances of AI-generated deepfake videos of politicians used to perpetrate financial scams. Compared to political misinformation, scams have very different dynamics (more sophisticated videos could be more convincing) and stakes (they involve individual financial harm rather than threats to democracy). Similarly, addressing scams requires different interventions—for instance, monitoring and removing networks of scammers is something major online platforms have been doing for a long time. In other words, scams are a different problem that we have other tools for addressing (regardless of the fact that some platforms arguably underinvest in doing so) and are outside the scope of this essay.

In the last legs of the 2024 U.S. election, Google and OpenAI restricted their chatbots from answering election-related queries—though competitors like Perplexity didn't, claiming that their product was highly accurate. Evaluating chatbots’ tendency to answer questions factually or abstain from answering questions, improving the factuality of responses, and ensuring chatbots work across different languages and contexts are important areas of work as more people turn to chatbots for answering questions.

To be clear, we should not treat such propaganda as something newly made possible by AI. It is the incremental evolution of long-standing techniques. Indeed, the cost of creating cartoon avatars for presidential campaigns would be minuscule with or without AI. The impact of propaganda depends not on the technical methods used to create it but rather on the freedom of the press to uplift competing narratives.

Does the UK’s liver transplant matching algorithm systematically exclude younger patients?

Arvind Narayanan — Mon, 11 Nov 2024 19:57:25 GMT

By Arvind Narayanan, Angelina Wang, Sayash Kapoor, and Solon Barocas

Predictive algorithms are used in many life-or-death situations. In the paper Against Predictive Optimization, we argued that the use of predictive logic for making decisions about people has recurring, inherent flaws, and should be rejected in many cases.

A wrenching case study comes from the UK’s liver allocation algorithm, which appears to discriminate by age, with some younger patients seemingly unable to receive a transplant, no matter how ill. What went wrong here? Can it be fixed? Or should health systems avoid using algorithms for liver transplant matching?

How the liver allocation algorithm works

The UK nationalized its liver transplant system in 2018, replacing previous regional systems where livers were prioritized based on disease severity.1 When a liver becomes available, the new algorithm uses predictive logic to calculate how much each patient on the national waiting list would benefit from being given that liver.

Specifically, the algorithm predicts how long each patient would live if they were given that liver, and how long they would live if they didn’t get a transplant. The difference between the two is the patient’s Transplant Benefit Score (TBS). Patients are sorted in decreasing order of the score, and the top patient is offered the liver (if they decline, the next patient is offered, and so on).

Given this description, one would expect that the algorithm would favor younger patients, as they will potentially gain many more decades of life through a transplant compared to older patients. If the algorithm has the opposite effect, either the score has been inaccurately portrayed or it is being calculated incorrectly. We’ll see which one it is. But first, let’s discuss a more basic question.

Why is predictive AI even needed?

Discussions of the ethics of algorithmic decision making often narrowly focus on bias, ignoring the question of whether it is legitimate to use an algorithm in the first place. For example, consider pretrial risk prediction in the criminal justice system. While bias is a serious concern, a deeper question is whether it is morally justified to deny defendants their freedom based on a prediction of what they might do rather than a determination of guilt, especially when that prediction is barely more accurate than a coin flip.

Organ transplantation is different in many ways. The health system needs to make efficient and ethical use of a very limited and valuable resource, and must find some principled way of allocating it to many deserving people, all of whom have reasonable claims for why they should be entitled to it. There are thousands of potential recipients, and decisions must be made quickly when an organ becomes available. Human judgment doesn’t scale.2

Another way to try to avoid the need for predictive algorithms is to increase the pool of organs so that they are no longer as scarce. Encouraging people to sign up for organ donation is definitely important. But even if the supply of livers is no longer a constraint, it would still be useful to predict which patient will benefit the most from a specific liver.

Sometimes simple statistical formulas provide most of the benefits of predictive AI without the downsides. In fact, the previous liver transplant system in the UK was based on a relatively simple formula for predicting disease severity, called the UK End-stage Liver Disease score, which is based on the blood levels of a few markers. The new system takes into account the benefit of transplantation in addition to disease severity. It is also more of a black box. It is “AI” in the sense that it is derived from a data-driven optimization process and is too complex to be mentally understood by doctors or patients. It uses 28 variables from the donor and recipient to make a prediction.

It seems at least plausible that this complexity is justified in this context because health outcomes are much more predictable than who will commit a crime (though this varies by disease). Follow-up studies have confirmed that the matching algorithm does indeed save more lives than the system that it replaced.

So there isn’t necessarily a prima facie case for arguing against the use of the algorithm. Instead, we have to look at the details of what went wrong. Let’s turn to those details.

The Financial Times investigation

In November 2023, the Financial Times published a bombshell investigation about bias in the algorithm. It centers on a 31 year old patient, Sarah Meredith, with multiple genetic conditions including cystic fibrosis. It describes her accidental discovery that the Transplant Benefit Score algorithm even existed and would decide her fate; her struggle to understand how it worked; her liver doctors’ lack of even basic knowledge about the algorithm; and her realization that there was no physician override to the TBS score and no appeals process.

When she reached out to the National Health Service to ask for explanations, Meredith was repeatedly told she wouldn’t understand. It seems that the paternalism of health systems combined with the myth of the inscrutability of algorithms is a particularly toxic mix.

Meredith eventually landed on a web app that calculates the TBS, built by Professor Ewen Harrison and his team. He is a surgeon and data scientist who has studied the TBS, and is a co-author of a study of some of the failures of the algorithm. It is through this app that Meredith realized how biased the algorithm is. It also shows why the inscrutability of algorithmic decision making is a myth: even without understanding the internals, it is easy to understand the behavior of the system, especially given that a particular patient only cares about how the system behaves in one specific instance.

But this isn’t just one patient’s experience. From the Financial Times piece:

“If you’re below 45 years, no matter how ill, it is impossible for you to score high enough to be given priority scores on the list,” said Palak Trivedi, a consultant hepatologist at the University of Birmingham, which has one of the country’s largest liver transplant centres.

Finally, a 2024 study in The Lancet has confirmed that the algorithm has a severe bias against younger patients.3

Patient groups warned about the bias

The objective of the matching system is to identify the recipient whose life expectancy would be increased the most through the transplant. The obvious way to do this is to predict each patient’s expected survival time with and without the transplant. This is almost what the algorithm does, but not quite — it predicts each patient’s likelihood of surviving 5 years with and without the transplant.

The problem with this is obvious. A patient group gave this feedback through official channels in 2015, long before the algorithm went into effect:

Capping survival at five years in effect diminishes the benefits for younger patients as it underestimates the gain in life years by predicting lifetime gain over 5 years, as opposed to the total lifetime gain. Paediatric and small adult patients benefit from accessing small adult livers as a national priority in the Current System. However, young adults must compete directly with all other adult patients. In the proposed model, there is no recognition that a death in a younger patient is associated with a greater number of expected years of life lost compared with the death of an older adult patient. There is also no recognition that longer periods waiting has an impact on younger patients’ prospects, such as career and family, and contribution to society compared with older adult patients. Younger patients have not yet had the chance to live their lives and consideration should be given to how the cohort of younger waiting list patients is affected by rules applied to calculate their benefit.

This is what leads to the algorithm’s behavior. Younger patients are (correctly) predicted to be more likely to survive 5 years without a transplant, and about as likely as older patients to survive 5 years with a transplant. So younger patients’ predicted net benefit (over a 5-year period) is much less than older patients’. Over the entire course of their lives, younger patients would likely benefit more, but the algorithm doesn’t take this into account.

Show us the target variable and we’ll show you the problem

It is not clear why the problem was ignored, both in version 1 of the algorithm in 2018 and in version 2 in 2022 which corrected a bias against cancer patients (we’ll get to that bias in a minute). Perhaps the developers did not recognize how severe the age bias is. Even in a 2024 paper about the algorithm, where they briefly discuss many of its limitations including the five-year cap, they do not mention that the cap de-prioritizes younger patients.

On the other hand, the list of features (donor and recipient characteristics) is prominently listed and discussed in public communications about the system. This may reflect a misconception that the way to understand an algorithm, including its potentially discriminatory effects, is to look at the list of features — the inputs. In reality, the target variable — the output — is often more important for fairness than the features.

Unfortunately there is little recognition of this crucial fact outside the technical community (and sometimes even within the technical community). Instead there is a narrow focus on removing sensitive variables (such as age, race, or gender) and proxies for the sensitive variables from the list of features, which is usually ineffective and often even counterproductive.

The choice of a 5-year period seems to be because of data availability: “This length of follow-up was selected as data were readily available ... while longer follow-up was not.” In our experience, there is almost always some difficulty that prevents accurately measuring the true construct of interest, which is why this is one of the recurring flaws we identify in the Against Predictive Optimization paper. It is a target-construct mismatch, because what is being predicted, the target, differs from what we actually want to predict, the construct.

It gets worse

The cap means that the expected survival with a transplant for most patient groups is about the same (about 4.5 years, reflecting the fact that about 85% of patients survive 5 years after a transplant). So the utility of the transplant, while high, is more-or-less uniformly high, which means that it doesn’t really factor into the scores! It turns out that the algorithm is mostly just assessing need, that is, how long patients would survive without a transplant.

This is ironic because modeling post-transplant survival was claimed to be the main reason to use this system over the previous one. If it keeps more people from dying, we suspect it is simply because it does a better job of assessing need, and/or because the use of the algorithm coincided with a move from regional to national systems, allowing it to better cater to high-need patients in previously under-served regions.

The fact that the system isn’t very good at meeting its stated objectives only seems to have been reported a decade after the algorithm was developed (although in retrospect, there were clear signals in the results of the simulations that were run before deployment). Specifically, it is noted in the comment-and-response section of a paper about the algorithm. In terms of obscurity, that’s the academic equivalent of Wikipedia’s Talk pages — most of the public wouldn’t even know such a thing exists.

An algorithmic absurdity: cancer improves survival

While the authors of the above paper mention in passing that one of the two models in the algorithm (post-transplant survival) doesn’t seem to do much, their main point is about the other model — the one that assesses need by predicting survival on the waiting list. They show that it expects patients with cancer to survive longer than those without cancer (all else being equal). This kind of thing is sometimes called algorithmic absurdity, something that would seem obviously wrong to a person based on common sense.

The prediction about patients with cancer is not just an oddity — it has big consequences for patients’ lives: “for the first 3 years of the TBS scheme (excluding the period when TBS offering was suspended due to COVID-19), patients with cancer were rarely allocated livers by the TBS model”. This is what led to the 2022 revision of the algorithm.

The finding is reminiscent of a well-known failure from a few decades ago wherein a model predicted that patients with asthma were at lower risk of developing complications from pneumonia. Fortunately this was spotted before the model was deployed. It turned out to be a correct pattern in the data, but only because asthmatic patients were sent to the ICU, where they received better care. Of course, it would have been disastrous to replace that very policy with the ML model that treated asthmatic patients as lower risk. That case study has become a textbook illustration of the usefulness of interpretable models over black-box models. If researchers can easily examine the coefficients of the model, implausible behaviors become more readily apparent.

The TBS does use interpretable regression models. But it is actually two different sets of models, one for patients with cancer and one for patients without cancer, because the two groups are represented by two different data sources. That explains why the implausible behavior of the algorithm may have arisen — the patient populations are different; perhaps the population from which the cancer patients were drawn was younger or healthier in other ways. Of course, this doesn’t justify the algorithm’s behavior where flipping a specific patient from non-cancer to cancer increases the predicted survival. The fact that there are two different sets of models may also explain why it went undetected for so long — the problem is not obvious from the regression coefficients and can only be detected by simulating a patient population.

Sleepwalking into utilitarian ethics

Predictive logic bakes in a utilitarian worldview — the most good for the greatest number. That makes it hard to incorporate a notion of deservingness. Many people have a strong moral intuition that patients whose conditions result from factors outside their control are more deserving of help. From the Financial Times article:

Trivedi [the hepatologist] said patients found [the bias against younger patients] particularly unfair, because younger people tended to be born with liver disease or develop it as children, while older patients more often contracted chronic liver disease because of lifestyle choices such as drinking alcohol.

Donor preferences are also neglected. For example, presumably some donors would prefer to help someone in their own community. But in the utilitarian worldview, this is simply geographic discrimination. (Our point is not about whether deservingness or donor preferences are important considerations, but rather that the algorithm dictates the ethical framework.)

Traditionally, individual physicians made decisions about transplants without much formal reasoning or accountability. But with routinization and increasing scale of organ transplantation, and the shift to nationwide matching systems, manual matching is no longer feasible. Automation has forced decision makers to make the matching criteria explicit. This formalization can be a good thing, as it allows ethical debate about the pros and cons of precisely specified policies.

But automation has also privileged utilitarianism, as it is much more amenable to calculation. Non-utilitarian considerations resist quantification. No committee of decision makers would want to be in charge of determining how much of a penalty to apply to patients who drank alcohol, and whatever choice they made would meet fierce objection. In contrast, the veneer of data-driven decision making, even though it hides many normative choices, allows decision makers to reach consensus and to deploy algorithms without endless debate.

For this reason, utilitarianism has been ascendant in many, many domains over the last few decades, including medical ethics and public health.

While the liver matching algorithm optimizes life years (albeit poorly), other algorithms and institutions go one step further and optimize “quality-adjusted” life years, taking into account factors such as how well a person is able to complete daily tasks and how much pain they are in. Quality adjustment has side-effects such as giving lower preference to disabled people.

Overall, we are not necessarily against this shift to utilitarian logic, but we think it should only be adopted if it is the result of a democratic process, not just because it’s more convenient. The tail shouldn’t wag the dog. It isn’t clear to what extent the wider public is even aware of the widespread shift to nationalized transplant systems — in many countries, for many organs — and the ethical logics that underpin them. Public input about specific systems, such as the one we’ve discussed, is not a replacement for broad societal consensus on the underlying moral frameworks. Nor should this debate be confined to the medical ethics literature.

What’s next

The liver allocation algorithm was developed and is run by the National Health Service (NHS), the UK’s publicly-funded health system. We’ve previously explained in this newsletter that bad outcomes result when public sector agencies outsource algorithmic decision making systems to opaque, profit-oriented companies. That’s not the case here. The developers are doing their best to save lives. A lot of thought and care went into the system, and there was public input. If there were missteps, they are a result of how hard the problem is.

There are clear problems with the liver allocation algorithm that can and should be addressed. There are at least three ways to mitigate the age bias. The first is to collect more and better data. The second is to put a thumb on the algorithm’s scale to ensure that the age distribution of recipients is roughly in line with society’s normative ideals. This can be achieved by formulating a constrained optimization problem (there are many papers on algorithmic fairness that show how to do this). The third is to stop using age as a factor. We don’t like this approach for reasons described above, but it is perhaps more easily defensible to non-experts.

The Liver Advisory Group is the entity with the power to effect changes. The members meet every six months. Unfortunately they haven’t yet uploaded their minutes from their May 2024 meeting, so it isn’t clear if they are paying attention.

The deeper, systemic problem will be harder to address — inadequate transparency and public participation in medical ethics. The rapid adoption of AI for medical decision making requires a whole-of-society ethical debate. This isn’t about specific algorithms but about the bundle of unexamined assumptions behind their claim to efficacy and thus to legitimacy. Better late than never.

Zooming out beyond medicine, the pitfalls that arise in disparate applications of predictive decision making bear striking similarities with each other. This calls for more research on avoiding these flaws as well as a community of practitioners from different fields who can learn from each other. Venues such as the conference on Fairness, Accountability, and Transparency can bring such cross-cutting groups together.

Acknowledgment

We are grateful to Emma Pierson for feedback on a draft of this essay.

For more details on the previous system, see this article.

There are many other differences between human and algorithmic decision making. Algorithms tend to be much better at optimizing a given objective, such as maximizing the number of life years gained through transplantation. But human decision makers can more easily incorporate multiple objectives, especially non-utilitarian ones. Human decision makers of course have their own biases; whose biases are worse and which ones are easier to mitigate are complex questions whose answers are context-dependent.

Update: Regarding the quote in the Financial Times, an NHS Blood and Transplant spokesperson said: “People can receive transplants through various pathways. Some use the TBS and some do not use the TBS. It is incorrect to say that no-one under 40 could receive a transplant. It is also incorrect to say that no-under 40 could receive a transplant as the patient most in need according to the TBS. It could also unduly worry patients on the transplant list. Over the past three UK financial years, 133 people aged 17 to 39 received transplants that were allocated to them through the TBS, as the patient in the country who would benefit most from that liver. Additionally, over the past three financial years, a further 285 people aged 17 to 39 received liver only transplants through other pathways.”

FAQ about the book and our writing process

Arvind Narayanan — Fri, 04 Oct 2024 15:55:33 GMT

The AI Snake Oil book was published last week. We’re grateful for the level of interest — it’s sold about 8,000 copies so far. We’ve received many questions about the book, both its substance and the writing process. Here are the most common ones.

Why don’t you recognize the benefits of AI?

We do! The book is not an anti-technology screed. If our point was that all AI is useless, we wouldn’t need a whole book to say it. It’s precisely because of AI’s usefulness in many areas that hype and snake oil have been successful — it’s hard for people to tell these apart, and we hope our book can help.

We also recognize that the harms we describe are usually not solely due to tech, and much more often due to AI being an amplifier of existing problems in our society. A recurring pattern we point out in the book is that "broken AI is appealing to broken institutions" (Chapter 8).

What’s your optimistic vision for AI, then?

There’s a humorous definition of AI that says “AI is whatever hasn’t been done yet”. When an AI application starts working reliably, it disappears into the background of our digital or physical world. We take it for granted. And we stop calling it AI. When a technology is new, doesn’t work reliably, and has double-edged societal implications, we’re more likely to call it AI. So it’s easy to miss that AI already plays a huge positive role in our lives.

There’s a long list of applications that would have been called AI at one point but probably wouldn’t be today: Robot vacuum cleaners, web search, autopilot in planes, autocomplete, handwriting recognition, speech recognition, spam filtering, and even spell check. These are the kinds of AI we want more of — reliable tools that quietly make our lives better.

Many AI applications that make the news for the wrong reasons today — such as self-driving cars due to occasional crashes — are undergoing this transition (although, as we point out in the book, it has taken far longer than developers and CEOs anticipated). We think people will eventually take self-driving cars for granted as part of our physical environment.

Adapting to these changes won’t be straightforward. It will lead to job loss, require changes to transportation infrastructure and urban planning, and have various ripple effects. But it will have been a good thing, because the safety impact of reliable self-driving tech can’t be overstated.

What’s the central message of the book?

AI is an umbrella term for a set of loosely related technologies and applications. To answer questions about the benefits or risks of AI, its societal impact, or how we should approach the tech, we need to break it down. And that’s what we do in the book.

We’re broadly negative about predictive AI, a term we use to refer to AI that’s used to make decisions about people based on predictions about their future behavior or outcomes. It’s used in criminal risk prediction, hiring, healthcare, and many other consequential domains. Our chapters on predictive AI have many horror stories of people denied life opportunities because of algorithmic predictions.

It’s hard to predict the future, and AI doesn’t change that. This is not because of a limitation of the technology but because of inherent limits to predicting human behavior grounded in sociology. (The book owes a huge debt to Princeton sociologist Matt Salganik; our collaboration with him informed and inspired the book.)

Generative AI, on the other hand, is a double-edged technology. We are broadly positive about it in the long run, and emphasize that it is useful to essentially every knowledge worker. But its rollout has been chaotic, and misuses have been prevalent. It’s as if everyone in the world has simultaneously been given the equivalent of a free buzzsaw. As we say in the book:

What else is in the book?

See the overview of the chapters here.

Isn’t your book going to be outdated soon?

We know that book publishing moves at a slower timescale than AI. So the book is about the foundational knowledge needed to separate real advances from hype, rather than commentary on breaking developments. In writing every chapter, and every paragraph, we asked ourselves: will this be relevant in five years? This also means that there’s very little overlap between the newsletter and the book.

There seem to be three warring camps: AI safety, e/acc, and AI ethics. Which one are you in?

The AI discourse is polarized because of differing opinions about which AI risks matter, how serious and urgent they are, and what to do about them. In broad strokes:

The AI safety community considers catastrophic AI risks a major societal concern, and supports government intervention. It has strong ties to the effective altruism movement.
e/acc is short for effective accelerationism, a play on effective altruism. It is a libertarian movement that sees tech as the solution and rejects government intervention.
The AI ethics community focuses on materialized harms from AI such as discrimination and labor exploitation, and sees the focus on AI safety as a distraction from those priorities.

In the past, the two of us worked on AI ethics and saw ourselves as part of that community. But we no longer identify with any of these labels. We view the polarization as counterproductive. We used to subscribe to the “distraction“ view but no longer do. The fact that safety concerns have made AI policy a priority has increased, not decreased policymakers’ attention to issues of AI and civil rights. These two communities both want AI regulation, and should focus on their common ground rather than their differences.

These days, much of our technical and policy work is on AI safety, but we have explained how we have a different perspective from the mainstream of the AI safety community. We see our role as engaging seriously with safety concerns and presenting an evidence-based vision of the future of advanced AI that rejects both apocalyptic and utopian narratives.

How long did it take to write the book?

It depends on what one means by writing the book. The book is not just an explainer, and developing a book’s worth of genuinely new, scholarly ideas takes a long time. Here’s a brief timeline:

2019: Arvind developed an early version of the high-level thesis of the book
2020: We started doing research and publishing papers that informed the book
mid-2022: Started writing the book and launched this newsletter
Sep 2023: Submitted the initial author manuscript
Jan 2024: Submitted the final author manuscript after addressing peer reviewers’ feedback
May 2024: Final proofs done
Sep 2024: Publication

What was the writing process like?

Doing the bulk of the writing in a year required a lot of things to go right. Here’s the process we used.

We figured out the structure up front. Changes that affect multiple chapters are much harder to pull off than changes within a chapter. Since we’d been thinking about the topics of the book for years before we started writing, we already knew at a high level what we wanted to say.
Throughout, we had periodic check-ins with our editor, Hallie Stebbins. Early on, Hallie helped us sanity check our decisions about structure, and sharing our progress with her gave us something to look forward to. In the later stages, her input was critical.
We divided up the chapters between us. Of course, we were both involved in every chapter, but it’s way less messy if one person takes the lead on each one. For this to work well, we had to both use the same “voice”. Can you tell who took the lead on which chapter?
We sent Hallie our drafts of each chapter as we completed them (after a couple of rounds of internal editing), instead of waiting till the end. We’re glad we did! Although we’re decent writers, Hallie had, on average, a couple of edits or suggestions per paragraph, mostly to fix awkward wording or point out something that was confusing.
While the line edits made the book dramatically more readable, even more important was her high-level feedback. Notably, she repeatedly asked us “how does this relate to the AI Snake Oil theme?” which helped keep us focused.
Oh, and Hallie couldn’t tell who took the lead on which chapter, which was a big relief!
We wrote the introductory chapter last. We know far more people will read the intro than the rest of the book, in part because it’s available online, so we really wanted to get it right. This was easier to do at the end, once we knew exactly what the message of each chapter was.
The next step was peer review. We received reviews from Melanie Mitchell, Molly Crockett, Chris Bail, and three anonymous reviewers. Between them, they had over 30 pages of feedback, for which we are extremely grateful. It took a couple of months to address all of it, but we’re glad we did.
Overall, each chapter underwent 6-8 rounds of editing, including copyediting. That’s pretty normal!
There’s a lot of work that goes into publicizing the book. Between the two of us we’ve done about 50 talks, interviews, and podcasts in the last couple of months, and there’s a whole lot more that our publicist, Maria Whelan, and others did for us behind the scenes!

We hope you like the end result. Let us know what you think, in the comments or on Amazon.

Can AI automate computational reproducibility?

Sayash Kapoor — Wed, 18 Sep 2024 14:32:47 GMT

Last month, Sakana AI released an "AI scientist", which the company called "the first comprehensive system for fully automatic scientific discovery". It was touted as being able to accelerate science without suffering from human limitations.

Unfortunately, the "AI Scientist" has many shortcomings. It has no checks for novelty, so generated papers could rehash earlier work. And Sakana did not perform any human review (let alone expert “peer” review) of the generated papers—so it is unclear if the papers are any good (apparently they are not). While these flaws are particularly flagrant in Sakana's case, the lack of good evaluation affects most AI agents, making it hard to measure their real-world impact.

Today, we introduce a new benchmark for measuring how well AI can reproduce existing computational research. We also share how this project has changed our thinking about “general intelligence” and the potential economic impact of AI. Read the paper.

CORE-Bench: A new benchmark for evaluating AI for reproducing research

Visions of AI automating science are enticing, but aren’t within reach, and lead to flawed science. In contrast, using AI for well-scoped tasks such as verifying computational reproducibility can save a lot of time and redirect effort towards more productive scientific activity. AI could also help find relevant literature, write code to rapidly test ideas, and perform other computational tasks.

In a new paper, we introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark for measuring how well AI can automate computational reproducibility, that is, reproducing a paper’s findings when the code and data are available. The authors are Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-Bench is a first step in a larger project to rigorously evaluate progress in automating research tasks of increasing difficulty.

Computationally reproducing a study is a far more limited task than replication, which requires re-running experiments that might involve human subjects. Even the limited reproducibility task is hard: In the 2022 Machine Learning Reproducibility Challenge, over a third of the papers could not be reproduced even when experts reproducing the papers had the code and data.

If AI could automate this mundane yet important task, researchers could automate the implementation of baselines, reviewers could more easily assess if a paper has flaws, and journals and conferences could more easily verify if submitted and published papers are reproducible.

We created CORE-Bench using scientific papers and their accompanying code and data repositories. We used Code Ocean to source papers that were likely to be reproducible. We manually reproduced 90 papers from computer science, medicine, and social science, and curated a set of questions for each paper to be able to verify the answers.

We release CORE-Bench with three difficulty levels. Tasks in all three levels require the use of both language and vision capabilities. The hardest version closely resembles real-world reproduction attempts, and we expect that improvements on the benchmark will translate to agents that are actually useful to scientists.

To implement baselines, we tested the generalist AutoGPT agent and also implemented a task-specific modification to AutoGPT, which we call CORE-Agent. While the task-specific version improved accuracy significantly, there is still massive room for improvement: the best agent (CORE-Agent with GPT-4o) has an accuracy of 22% on CORE-Bench-Hard.

Rethinking generality

Computational reproducibility requires setting up the code environment correctly, running the code, and seeing if it produces the same results as reported in the paper. Using the shell and other tools correctly is still tricky for LLMs. When we evaluated generalist agents like AutoGPT, we weren't surprised by their poor accuracy (less than 10% on CORE-Bench-Hard).

Yet, with a few person-days of effort, we were able to build CORE-Agent by modifying AutoGPT, which more than doubled accuracy on the hardest level. We also built a task-specific agent from scratch, but modifying AutoGPT was far less time consuming while also resulting in a stronger agent. We are cautiously optimistic that this approach can be pushed to yield agents that perform well enough to be useful in practice.

Simple task-specific modifications allow CORE-Agent to outperform AutoGPT.

If this pattern of being able to easily adapt a generalist agent to produce a task-specific agent holds in other areas, it should make us rethink generality. Generality roughly translates to being able to use the same model or agent without modification to perform a variety of tasks. This notion of generality underpins how Artificial General Intelligence (or AGI) is usually understood and the hopes and fears that accompany it.

But at least from the point of view of economic impacts, generality might be a red herring. For a task such as computational reproducibility on which expert humans collectively spend millions of hours every year, being able to automate it would be hugely impactful — regardless of whether the AI system did so out of the box, or after a few person days (or even a person year) of programmer effort.

In the AI Snake Oil book, we define generality as the inverse of task-specificity, and analyze how the history of AI (and computing) can be seen as the pursuit of gradually increasing generality. Increasing generality means decreasing the human effort it takes to build an AI system to perform a given task. From this perspective, systems like AutoGPT may be more general than most people (including us) gave them credit for.

Yet, definitions of AGI typically insist that a single system be able to do everything out of the box. There is no systematic effort to track how the human effort needed to build task-specific AI is changing over time. Just as we’ve argued against flawed conceptions of generality that overestimate AI progress, we should avoid flawed conceptions of generality that underestimate it.

Read the CORE-Bench paper here.

Start reading the AI Snake Oil book online

Arvind Narayanan — Tue, 10 Sep 2024 20:55:03 GMT

The first chapter of the AI snake oil book is available online. It is 30 pages long and summarizes the book’s main arguments. If you haven't ordered the book yet, we hope that reading the introductory chapter will convince you to get yourself a copy.

Update (September 2025): It has been a year since the release of AI Snake Oil. In the time since its release, the two of us have given talks, appeared on podcasts, published exercises to accompany the book, and written a new preface and epilogue for the paperback edition of the book. The book was included in Nature’s list of the 10 best books of 2024, Bloomberg’s 49 best books of 2024, and Forbes’s 10 must-read tech books of 2024. It has received many positive reviews, including in the New Yorker. We are grateful to readers of the book for engaging deeply with its ideas.

We have now started working on our next project together, AI as Normal Technology. The project picks up where AI Snake Oil left off: whereas AI Snake Oil was an attempt to understand the present and near-term impacts of AI, AI as Normal Technology is a framework to think about its future impacts. The new name of this newsletter reflects this change. We hope you will follow along.

The single most confusing thing about AI

Our book is about demystifying AI, so right out of the gate we address what we think is the single most confusing thing about it:

AI is an umbrella term for a set of loosely related technologies

Because AI is an umbrella term, we treat each type of AI differently. We have chapters on predictive AI, generative AI, as well as AI used for social media content moderation. We also have a chapter on whether AI is an existential risk. We conclude with a discussion of why AI snake oil persists and what the future might hold. By AI snake oil we mean AI applications that do not (and perhaps cannot) work. Our book is a guide to identifying AI snake oil and AI hype. We also look at AI that is harmful even if it works well — such as face recognition used for mass surveillance.

While the book is meant for a broad audience, it does not simply rehash the arguments we have made in our papers or on this newsletter. We make scholarly contributions and we wrote the book to be suitable for adoption in courses. We will soon release exercises and class discussion questions to accompany the book.

What's in the book

Chapter 1: Introduction. We begin with a summary of our main arguments in the book. We discuss the definition of AI (and more importantly, why it is hard to come up with one), how AI is an umbrella term, what we mean by AI Snake Oil, and who the book is for.

Generative AI has made huge strides in the last decade. On the other hand, predictive AI is used for predicting outcomes to make consequential decisions in hiring, banking, insurance, education, and more. While predictive AI can find broad statistical patterns in data, it is marketed as far more than that, leading to major real-world misfires. Finally, we discuss the benefits and limitations of AI for content moderation on social media.

We also tell the story of what led the two of us to write the book. The entire first chapter is now available online.

Chapter 2: How predictive AI goes wrong. Predictive AI is used to make predictions about people—will a defendant fail to show up for trial? Is a patient at high risk of negative health outcomes? Will a student drop out of college? These predictions are then used to make consequential decisions. Developers claim predictive AI is groundbreaking, but in reality it suffers from a number of shortcomings that are hard to fix.

We have discussed the failures of predictive AI in this blog. But in the book, we go much deeper through case studies to show how predictive AI fails to live up to the promises made by its developers.

Chapter 3: Can AI predict the future? Are the shortcomings of predictive AI inherent, or can they be resolved? In this chapter, we look at why predicting the future is hard — with or without AI. While we have made consistent progress in some domains such as weather prediction, we argue that this progress cannot translate to other settings, such as individuals' life outcomes, the success of cultural products like books and movies, or pandemics.

Since much of our newsletter is focused on topics of current interest, this is a topic that we have never written about here. Yet, it is foundational knowledge that can help you build intuition around when we should expect predictions to be accurate.

Chapter 4: The long road to generative AI. Recent advances in generative AI can seem sudden, but they build on a series of improvements over seven decades. In this chapter, we retrace the history of computing advances that led to generative AI. While we have written a lot about current trends in generative AI, in the book, we look at its past. This is crucial for understanding what to expect in the future.

Chapter 5: Is advanced AI an existential threat? Claims about AI wiping out humanity are common. Here, we critically evaluate claims about AI's existential risk and find several shortcomings and fallacies in popular discussion of x-risk. We discuss approaches to defending against AI risks that improve societal resilience regardless of the threat of advanced AI.

Chapter 6: Why can't AI fix social media? One area where AI is heavily used is content moderation on social media platforms. We discuss the current state of AI use on social media, and highlight seven reasons why improvements in AI alone are unlikely to solve platforms' content moderation woes. We haven't written about content moderation in this newsletter.

Chapter 7: Why do myths about AI persist? Companies, researchers, and journalists all contribute to AI hype. We discuss how myths about AI are created and how they persist. In the process, we hope to give you the tools to read AI news with the appropriate skepticism and identify attempts to sell you snake oil.

Chapter 8: Where do we go from here? While the previous chapter focuses on the supply of snake oil, in the last chapter, we look at where the demand for AI snake oil comes from. We also look at the impact of AI on the future of work, the role and limitations of regulation, and conclude with vignettes of the many possible futures ahead of us. We have the agency to determine which path we end up on, and each of us can play a role.

We hope you will find the book useful and look forward to hearing what you think.

Early reviews

The New Yorker: "In AI Snake Oil, Arvind Narayanan and Sayash Kapoor urge skepticism and argue that the blanket term AI can serve as a smokescreen for underperforming technologies."
Kirkus: "Highly useful advice for those who work with or are affected by AI—i.e., nearly everyone."
Publishers' Weekly: Featured in the Fall 2024 list of top science books.
Jean Gazis: "The authors admirably differentiate fact from opinion, draw from personal experience, give sensible reasons for their views (including copious references), and don’t hesitate to call for action. . . . If you’re curious about AI or deciding how to implement it, AI Snake Oil offers clear writing and level-headed thinking."
Elizabeth Quill: "A worthwhile read whether you make policy decisions, use AI in the workplace or just spend time searching online. It’s a powerful reminder of how AI has already infiltrated our lives — and a convincing plea to take care in how we interact with it."
The Telegraph

Book launch events

September 24: City Lights (virtual, free)
September 30: Princeton alumni events (virtual, free)
October 24: Princeton Public Library (Princeton, free)

Podcasts and interviews

We’ve been on many other podcasts that will air around the time of the book’s release, and we will keep this list updated.

Purchase links

US: Amazon, Bookshop, Barnes and Noble, Princeton University Press. Audiobook, Kindle editions.
UK: Blackwell’s, Waterstones.
Canada: Indigo.
Germany: Amazon, Kulturkaufhaus.
India: Amazon

The book is available to preorder internationally on Amazon.

AI companies are pivoting from creating gods to building products. Good.

Arvind Narayanan — Mon, 19 Aug 2024 20:57:05 GMT

AI companies are collectively planning to spend a trillion dollars on hardware and data centers, but there’s been relatively little to show for it so far. This has led to a chorus of concerns that generative AI is a bubble. We won’t offer any predictions on what’s about to happen. But we think we have a solid diagnosis of how things got to this point in the first place.

In this post, we explain the mistakes that AI companies have made and how they have been trying to correct them. Then we will talk about five barriers they still have to overcome in order to make generative AI commercially successful enough to justify the investment.

Product-market fit

When ChatGPT launched, people found a thousand unexpected uses for it. This got AI developers overexcited. They completely misunderstood the market, underestimating the huge gap between proofs of concept and reliable products. This misunderstanding led to two opposing but equally flawed approaches to commercializing LLMs.

OpenAI and Anthropic focused on building models and not worrying about products. For example, it took 6 months for OpenAI to bother to release a ChatGPT iOS app and 8 months for an Android app!

Google and Microsoft shoved AI into everything in a panicked race, without thinking about which products would actually benefit from AI and how they should be integrated.

Both groups of companies forgot the “make something people want” mantra. The generality of LLMs allowed developers to fool themselves into thinking that they were exempt from the need to find a product-market fit, as if prompting a model to perform a task is a replacement for carefully designed products or features.

OpenAI and Anthropic’s DIY approach meant that early adopters of LLMs disproportionately tended to be bad actors, since they are more invested in figuring out how to adapt new technologies for their purposes, whereas everyday users want easy-to-use products. This has contributed to a poor public perception of the technology.1

Meanwhile the AI-in-your-face approach by Microsoft and Google has led to features that are occasionally useful and more often annoying. It also led to many unforced errors due to inadequate testing like Microsoft's early Sydney chatbot and Google's Gemini image generator. This has also caused a backlash.

But companies are changing their ways. OpenAI seems to be transitioning from a research lab focused on a speculative future to something resembling a regular product company. If you take all the human-interest elements out of the OpenAI boardroom drama, it was fundamentally about the company's shift from creating gods to building products. Anthropic has been picking up many of the researchers and developers at OpenAI who cared more about artificial general intelligence and felt out of place at OpenAI, although Anthropic, too, has recognized the need to build products.

Google and Microsoft are slower to learn, but our guess is that Apple will force them to change. Last year Apple was seen as a laggard on AI, but it seems clear in retrospect that the slow and thoughtful approach that Apple showcased at WWDC, its developer conference, is more likely to resonate with users.2 Google seems to have put more thought into integrating AI in its upcoming Pixel phones and Android than it did into integrating it in search, but the phones aren’t out yet, so let’s see.

And then there’s Meta, whose vision is to use AI to create content and engagement on its ad-driven social media platforms. The societal implications of a world awash in AI-generated content are double-edged, but from a business perspective it makes sense.

The big five challenges for consumer AI

There are five limitations of LLMs that developers need to tackle in order to make compelling AI-based consumer products.3 (We will discuss many of these in our upcoming online workshop on building useful and reliable AI agents on August 29.)

1. Cost

There are many applications where capability is not the barrier, cost is. Even in a simple chat application, cost concerns dictate how much history a bot can keep track of — processing the entire history for every response quickly gets prohibitively expensive as the conversation grows longer.

There has been rapid progress on cost — in the last 18 months, cost-for-equivalent-capability has dropped by a factor of over 100.4 As a result, companies are claiming that LLMs are, or will soon be, “too cheap to meter”. Well, we’ll believe it when they make the API free.

More seriously, the reason we think cost will continue to be a concern is that in many applications, cost improvements directly translate to accuracy improvements. That’s because repeatedly retrying a task tens, thousands, or even millions of times turns out to be a good way to improve the chances of success, given the randomness of LLMs. So the cheaper the model, the more retries we can make with a given budget. We quantified this in our recent paper on agents; since then, many other papers have made similar points.

That said, it is plausible that we’ll soon get to a point where in most applications, cost optimization isn’t a serious concern.

2. Reliability

We see capability and reliability as somewhat orthogonal. If an AI system performs a task correctly 90% of the time, we can say that it is capable of performing the task but it cannot do so reliably. The techniques that get us to 90% are unlikely to get us to 100%.

With statistical learning based systems, perfect accuracy is intrinsically hard to achieve. If you think about the success stories of machine learning, like ad targeting or fraud detection or, more recently, weather forecasting, perfect accuracy isn’t the goal — as long as the system is better than the state of the art, it is useful. Even in medical diagnosis and other healthcare applications, we tolerate a lot of error.

But when developers put AI in consumer products, people expect it to behave like software, which means that it needs to work deterministically. If your AI travel agent books vacations to the correct destination only 90% of the time, it won’t be successful. As we’ve written before, reliability limitations partly explain the failures of recent AI-based gadgets.

AI developers have been slow to recognize this because as experts, we are used to conceptualizing AI as fundamentally different from traditional software. For example, the two of us are heavy users of chatbots and agents in our everyday work, and it has become almost automatic for us to work around the hallucinations and unreliability of these tools. A year ago, AI developers hoped or assumed that non-expert users would learn to adapt to AI, but it has gradually become clear that companies will have to adapt AI to user expectations instead, and make AI behave like traditional software.

Improving reliability is a research interest of our team at Princeton. For now, it’s fundamentally an open question whether it’s possible to build deterministic systems out of stochastic components (LLMs). Some companies have claimed to have solved reliability — for example, legal tech vendors have touted “hallucination-free” systems. But these claims were shown to be premature.

3. Privacy

Historically, machine learning has often relied on sensitive data sources such browsing histories for ad targeting or medical records for health tech. In this sense, LLMs are a bit of an anomaly, since they are primarily trained on public sources such as web pages and books.5

But with AI assistants, privacy concerns have come roaring back. To build useful assistants, companies have to train systems on user interactions. For example, to be good at composing emails, it would be very helpful if models were trained on emails. Companies’ privacy policies are vague about this and it is not clear to what extent this is happening.6 Emails, documents, screenshots, etc. are potentially much more sensitive than chat interactions.

There is a distinct type of privacy concern relating to inference rather than training. For assistants to do useful things for us, they must have access to our personal data. For example, Microsoft announced a controversial feature that would involve taking screenshots of users’ PCs every few seconds, in order to give its CoPilot AI a memory of your activities. But there was an outcry and the company backtracked.

We caution against purely technical interpretations of privacy such as “the data never leaves the device.” Meredith Whittaker argues that on-device fraud detection normalizes always-on surveillance and that the infrastructure can be repurposed for more oppressive purposes. That said, technical innovations can definitely help.

4. Safety and security

There is a cluster of related concerns when it comes to safety and security: unintentional failures such as the biases in Gemini’s image generation; misuses of AI such as voice cloning or deepfakes; and hacks such as prompt injection that can leak users’ data or harm the user in other ways.

We think accidental failures are fixable. As for most types of misuses, our view is that there is no way to create a model that can’t be misused and so the defenses must primarily be located downstream. Of course, not everyone agrees, so companies will keep getting bad press for inevitable misuses, but they seem to have absorbed this as a cost of doing business.

Let’s talk about the third category — hacking. From what we can tell, it is the one that companies seem to be paying the least attention to. At least theoretically, catastrophic hacks are possible, such as AI worms that spread from user to user, tricking those users’ AI assistants into doing harmful things including creating more copies of the worm.

Although there have been plenty of proof-of-concept demonstrations and bug bounties that uncovered these vulnerabilities in deployed products, we haven't seen this type of attack in the wild. We aren’t sure if this is because of the low adoption of AI assistants, or because the clumsy defenses that companies have pulled together have proven sufficient, or something else. Time will tell.

5. User interface

In many applications, the unreliability of LLMs means that there will have to be some way for the user to intervene if the bot goes off track. In a chatbot, it can be as simple as regenerating an answer or showing multiple versions and letting the user pick. But in applications where errors can be costly, such as flight booking, ensuring adequate supervision is more tricky, and the system must avoid annoying the user with too many interruptions.

The problem is even harder with natural language interfaces where the user speaks to the assistant and the assistant speaks back. This is where a lot of the potential of generative AI lies. As just one example, AI that disappeared into your glasses and spoke to you when you needed it, without even being asked — such as by detecting that you were staring at a sign in a foreign language — would be a whole different experience than what we have today. But the constrained user interface leaves very little room for incorrect or unexpected behavior.

Concluding thoughts

AI boosters often claim that due to the rapid pace of improvement in AI capabilities, we should see massive societal and economic effects soon. We are skeptical of the trend extrapolation and sloppy thinking that goes into those capability forecasts. More importantly, even if AI capability does improve rapidly, developers have to solve the challenges discussed above. These are sociotechnical and not purely technical, so progress will be slow. And even if those challenges are solved, organizations need to integrate AI into existing products and workflows and train people to use it productively while avoiding its pitfalls. We should expect this to happen on a timescale of a decade or more rather than a year or two.

AI existential risk probabilities are too unreliable to inform policy

Arvind Narayanan — Fri, 26 Jul 2024 11:29:25 GMT

How seriously should governments take the threat of existential risk from AI, given the lack of consensus among researchers? On the one hand, existential risks (x-risks) are necessarily somewhat speculative: by the time there is concrete evidence, it may be too late. On the other hand, governments must prioritize — after all, they don’t worry too much about x-risk from alien invasions.

This is the first in a series of essays laying out an evidence-based approach for policymakers concerned about AI x-risk, an approach that stays grounded in reality while acknowledging that there are “unknown unknowns”.

In this first essay, we look at one type of evidence: probability estimates. The AI safety community relies heavily on forecasting the probability of human extinction due to AI (in a given timeframe) in order to inform decision making and policy. An estimate of 10% over a few decades, for example, would obviously be high enough for the issue to be a top priority for society.

Our central claim is that AI x-risk forecasts are far too unreliable to be useful for policy, and in fact highly misleading.

Look behind the curtain

If the two of us predicted an 80% probability of aliens landing on earth in the next ten years, would you take this possibility seriously? Of course not. You would ask to see our evidence. As obvious as this may seem, it seems to have been forgotten in the AI x-risk debate that probabilities carry no authority by themselves. Probabilities are usually derived from some grounded method, so we have a strong cognitive bias to view quantified risk estimates as more valid than qualitative ones. But it is possible for probabilities to be nothing more than guesses. Keep this in mind throughout this essay (and more broadly in the AI x-risk debate).

If we predicted odds for the Kentucky Derby, we don’t have to give you a reason — you can take it or leave it. But if a policymaker takes actions based on probabilities put forth by a forecaster, they had better be able to explain those probabilities to the public (and that explanation must in turn come from the forecaster). Justification is essential to legitimacy of government and the exercise of power. A core principle of liberal democracy is that the state should not limit people's freedom based on controversial beliefs that reasonable people can reject.

Explanation is especially important when the policies being considered are costly, and even more so when those costs are unevenly distributed among stakeholders. A good example is restricting open releases of AI models. Can governments convince people and companies who stand to benefit from open models that they should make this sacrifice because of a speculative future risk?

The main aim of this essay is analyzing whether there is any justification for any of the specific x-risk probability estimates that have been cited in the policy debate. We have no objection to AI x-risk forecasting as an academic activity, and forecasts may be helpful to companies and other private decision makers. We only question its use in the context of public policy.

There are basically only three known ways by which a forecaster can try to convince a skeptic: inductive, deductive, and subjective probability estimation. We consider each of these in the following sections. All three require both parties to agree on some basic assumptions about the world (which cannot themselves be proven). The three approaches differ in terms of the empirical and logical ways in which the probability estimate follows from that set of assumptions.

Inductive probability estimation is unreliable due to the lack of a reference class

Most risk estimates are inductive: they are based on past observations. For example, insurers base their predictions of an individual’s car accident risk on data from past accidents about similar drivers. The set of observations used for probability estimation is called a reference class. A suitable reference class for car insurance might be the set of drivers who live in the same city. If the analyst has more information about the individual, such as their age or the type of car they drive, the reference class can be further refined.

For existential risk from AI, there is no reference class, as it is an event like no other. To be clear, this is a matter of degree, not kind. There is never a clear “correct” reference class to use, and the choice of a reference class in practice comes down to the analyst’s intuition.

The accuracy of the forecasts depends on the degree of similarity between the process that generates the event being forecast and the process that generated the events in the reference class, which can be seen as a spectrum. For predicting the outcome of a physical system such as a coin toss, past experience is a highly reliable guide. Next, for car accidents, risk estimates might vary by, say, 20% based on the past dataset used — good enough for insurance companies.

Further along the spectrum are geopolitical events, where the choice of reference class gets even fuzzier. Forecasting expert Philip Tetlock explains: “Grexit may have looked sui generis, because no country had exited the Eurozone as of 2015, but it could also be viewed as just another instance of a broad comparison class, such as negotiation failures, or of a narrower class, such as a nation-states withdrawing from international agreements or, narrower still, of forced currency conversions.” He goes on to defend the idea that even seeming Black Swan events like the collapse of the USSR or the Arab Spring can be modeled as members of reference classes, and that inductive reasoning is useful even for this kind of event.

In Tetlock’s spectrum, these events represent the “peak” of uniqueness. When it comes to geopolitical events, that might be true. But even those events are far less unique than extinction from AI. Just look at the attempts to find reference classes for AI x-risk: animal extinction (as a reference class for human extinction), past global transformations such as the industrial revolution (as a reference class for socioeconomic transformation from AI), or accidents causing mass deaths (as a reference class for accidents causing global catastrophe). Let’s get real. None of those tell us anything about the possibility of developing superintelligent AI or losing control over such AI, which are the central sources of uncertainty for AI x-risk forecasting.

To summarize, human extinction due to AI is an outcome so far removed from anything that has happened in the past that we cannot use inductive methods to “predict” the odds. Of course, we can get qualitative insights from past technical breakthroughs as well as past catastrophic events, but AI risk is sufficiently different that quantitative estimates lack the kind of justification needed for legitimacy in policymaking.

Deductive probability estimation is unreliable due to the lack of theory

In Conan Doyle’s The Adventure of the Six Napoleons — spoiler alert! — Sherlock Holmes announces before embarking on a stakeout that the probability of catching the suspect is exactly two-thirds. This seems bewildering — how can anything related to human behavior be ascribed a mathematically precise probability?

It turns out that Holmes has deduced the underlying series of events that gave rise to the suspect’s seemingly erratic observed behavior: the suspect is methodically searching for a jewel that is known to be hidden inside one of six busts of Napoleon owned by different people in and around London. The details aren’t too important, but the key is that neither the suspect nor the detectives know which of the six busts it is in, and everything else about the suspect’s behavior is (assumed to be) entirely predictable. Hence the precisely quantifiable uncertainty.

The point is that if we have a model of the world that we can rely upon, we can estimate risk through logical deduction, even without relying on past observations. Of course, outside of fictional scenarios, the world isn’t so neat, especially when we want to project far into the future.

When it comes to x-risk, there is an interesting exception to the general rule that we don’t have deductive models — asteroid impact. A combination of inductive and deductive risk estimation does allow us to estimate the probability of x-risk, only because we’re talking about a purely physical system. Let’s take a minute to review how this works, because it’s important to recognize that the methods are not generalizable to other types of x-risk.

The key is being able to model the relationship between the size of the asteroid (more precisely, the energy of impact) and the frequency of impact. Since we have observed thousands of small impacts, we can extrapolate to infer the frequency of large impacts that have never been directly observed. We can also estimate the threshold that would cause global catastrophe.1

Figure: data on small asteroid impacts (illustrated on the left) can be extrapolated to extinction-level impacts (right).

With AI, the unknowns relate to technological progress and governance rather than a physical system, so it isn’t clear how to model it mathematically. Still, people have tried. For example, in order to predict the computational requirements of a hypothetical AGI, several works assume that an AI system would require roughly as many computations as the human brain, and further make assumptions about the number of computations required by the human brain. These assumptions are far more tenuous than those involved in asteroid modeling, and none of this even addresses the loss-of-control question.

Subjective probabilities are feelings dressed up as numbers

Without the reference classes or grounded theories, forecasts are necessarily “subjective probabilities”, that is, guesses based on the forecaster’s judgment. Unsurprisingly, these vary by orders of magnitude.

Subjective probability estimation does not get around the need for having either an inductive or a deductive basis for probability estimates. It merely avoids the need for the forecaster to explain their estimate. Explanation can be hard due to humans’ limited ability to explain our intuitive reasoning, whether inductive, deductive, or a combination thereof. Essentially, it allows the forecaster to say: “even though I haven’t shown my methods, you can trust this estimate because of my track record” (we explain in the next section why even this breaks down for AI x-risk forecasting). But ultimately, lacking either an inductive or a deductive basis, all that forecasters can do is to make up a number, and those made-up numbers are all over the place.

Consider the Existential Risk Persuasion Tournament (XPT) conducted by the Forecasting Research Institute in late 2022, which we think is the most elaborate and well-executed x-risk forecasting exercise conducted to date. It involved various groups of forecasters, including AI experts and forecasting experts (“superforecasters” in the figure). For AI experts, the high end (75th percentile) of estimates for AI extinction risk by 2100 is 12%, the median estimate is 3%, and the low end (25th percentile) is 0.25%. For forecasting experts, even the high end (75th percentile) is only 1%, the median is a mere 0.38%, and the low end (25th percentile) is visually indistinguishable from zero on the graph. In other words, the 75th percentile AI expert forecast and the 25th percentile superforecaster forecast differ by at least a factor of 100.

All of these estimates are from people who have deep expertise on the topic and participated in a months-long tournament where they tried to persuade each other! If this range of forecasts here isn’t extreme enough, keep in mind that this whole exercise was conducted by one group at one point in time. We might get different numbers if the tournament were repeated today, if the questions were framed differently, etc.

What’s most telling is to look at the rationales that forecasters provided, which are extensively detailed in the report. They aren’t using quantitative models, especially when thinking about the likelihood of bad outcomes conditional on developing powerful AI. For the most part, forecasters are engaging in the same kind of speculation that everyday people do when they discuss superintelligent AI. Maybe AI will take over critical systems through superhuman persuasion of system operators. Maybe AI will seek to lower global temperatures because it helps computers run faster, and accidentally wipe out humanity. Or maybe AI will seek resources in space rather than Earth, so we don’t need to be as worried. There’s nothing wrong with such speculation. But we should be clear that when it comes to AI x-risk, forecasters aren’t drawing on any special knowledge, evidence, or models that make their hunches more credible than yours or ours or anyone else’s.

The term superforecasting comes from Philip Tetlock’s 20 year study of forecasting (he was also one of the organizers of the XPT). Superforecasters tend to be trained in methods to improve forecasts such as by integrating diverse information and by minimizing psychological biases. These methods have been shown to be effective in domains such as geopolitics. But no amount of training will lead to good forecasts if there isn’t much useful evidence to draw from.

Even if forecasters had credible quantitative models (they don’t), they must account for “unknown unknowns”, that is, the possibility that the model itself might be wrong. As noted x-risk philosopher Nick Bostrom explains: “The uncertainty and error-proneness of our first-order assessments of risk is itself something we must factor into our all-things-considered probability assignments. This factor often dominates in low-probability, high-consequence risks — especially those involving poorly understood natural phenomena, complex social dynamics, or new technology, or that are difficult to assess for other reasons.”

This is a reasonable perspective, and AI x-risk forecasters do worry a lot about uncertainty in risk assessment. But one consequence of this is that for those who follow this principle, forecasts are guaranteed to be guesses rather than the output of a model — after all, no model can be used to estimate the probability that the model itself is wrong, or what the risk would be if the model were wrong.

Forecast skill cannot be measured when it comes to unique or rare events

To recap, subjective AI-risk forecasts vary by orders of magnitude. But if we can measure forecasters’ track records, maybe we can use that to figure out which forecasters to trust. In contrast to the previous two approaches for justifying risk estimates (inductive and deductive), the forecaster doesn’t have to explain their estimate, but instead justifies it based on their demonstrated skill at predicting other outcomes in the past.

This has proved to be invaluable in the domain of geopolitical events, and the forecasting community spends a lot of effort on skill measurement. Many ways to evaluate forecasting skill exist, such as calibration, the Brier score, the logarithmic score, or the Peer score used on the forecasting competition website Metaculus.

But regardless of which method is used, when it comes to existential risk, there are many barriers to assessing forecast skill for subjective probabilities: the lack of a reference class, the low base rate, and the long time horizon. Let’s look at each of these in turn.

Just as the reference class problem plagues the forecaster, it also affects the evaluator. Let’s return to the alien landing example. Consider a forecaster who has proved highly accurate at calling elections. Suppose this forecaster announces, without any evidence, that aliens will land on Earth within a year. Despite the forecaster’s demonstrated skill, this would not cause us to update our beliefs about an alien landing, because it is too dissimilar to election forecasting and we do not expect the forecaster’s skill to generalize. Similarly, AI x-risk is so dissimilar to any past events that have been forecast that there is no evidence of any forecaster’s skill at estimating AI x-risk.

Even if we somehow do away with the reference class problem, other problems remain — notably, the fact that extinction risks are “tail risks”, or risks that result from rare events. Suppose forecaster A says the probability of AI x-risk is 1%, and forecaster B says it is 1 in a million. Which forecast should we have more confidence in? We could look at their track records. Say we find that forecaster A (who has assigned a 1% probability to AI x-risk) has a better track record. It still doesn’t mean we should have more confidence in A’s forecast, because skill evaluations are insensitive to overestimation of tail risks. In other words, it could be that A scores higher overall because A is slightly better calibrated than B when it comes to everyday events that have a substantial probability of occurring, but tends to massively overestimate tail risks that occur rarely (for example, those with a probability of 1 in a million) by orders of magnitude. No scoring rule adequately penalizes this type of miscalibration.

Here’s a thought experiment to show why this is true. Suppose two forecasters F and G forecast two different sets of events, and the “true” probabilities of events in both sets are uniformly distributed between 0 and 1. We assume, highly optimistically, that both F and G know the true probability P[e] for every event e that they forecast. F always outputs P[e], but G is slightly conservative, never predicting a value less than 1%. That is, G outputs P[e] if P[e] >= 1%, otherwise outputs 1%.

By construction, F is the better forecaster. But would this be evident from their track records? In other words, how many forecasts from each would we have to evaluate so that there’s a 95% chance that F outscores G? With the logarithmic scoring rule, it turns out to be on the order of a hundred million. With the Brier score, it is on the order of a trillion.2 We can quibble with the assumptions here but the point is that if a forecaster systematically overestimates tail risks, it is simply empirically undetectable.

The final barrier to assessing forecaster skill at predicting x-risk is that long-term forecasts take too long to evaluate (and extinction forecasts are of course impossible to evaluate). This can potentially be overcome. Researchers have developed a method called reciprocal scoring — where forecasters are rewarded based on how well they predict each others’ forecasts — and validate it in some real-world settings, such as predicting the effect of Covid-19 policies. In these settings, reciprocal scoring yielded forecasts that are as good as traditional scoring methods. Fair enough. But reciprocal scoring is not a way around the reference class problem or the tail risk problem.

Summary of our argument so far, showing why none of the three forecasting methods can yield credible estimates of AI x-risk.

There are many reasons why risk estimates may be systematically inflated

To recap, inductive and deductive methods don’t work, subjective forecasts are all over the place, and there’s no way to tell which forecasts are more trustworthy.

So in an attempt to derive more reliable estimates that could potentially inform policy, some researchers have turned to forecast aggregation methods that combine the predictions of multiple forecasters. A notable effort is the AI Impacts Survey on Progress in AI, but it has been criticized for serious methodological limitations including non-response bias. More importantly, it is unclear why aggregation should improve forecast accuracy: after all, most forecasters might share the same biases (and again, none of them have any basis for a reliable forecast).

There are many reasons why forecasters might systematically overestimate AI x-risk.3 The first is selection bias. Take AI researchers: the belief that AI can change the world is one of the main motivations for becoming an AI researcher. And once someone enters this community, they are in an environment where that message is constantly reinforced. And if one believes that this technology is terrifyingly powerful, it is perfectly rational to think there is a serious chance that its world-altering effects will be negative rather than positive.

And in the AI safety subcommunity, which is a bit insular, the echo chamber can be deafening. Claiming to have a high p(doom) (one’s estimate of the probability of AI doom) seems to have become a way to signal one’s identity and commitment to the cause.

There is a slightly different selection bias at play when it comes to forecasting experts. The forecasting community has a strong overlap with effective altruism and concerns about existential risk, especially AI risk. This doesn’t mean that individual forecasters are biased. But having a high p(doom) might make someone more inclined to take up forecasting as an activity. So the community as a whole is likely biased toward people with x-risk worries.

Forecasters are good at updating their beliefs in response to evidence, but the problem is that unlike, say, asteroid impact risk, there is little evidence that can change one’s beliefs one way or another when it comes to AI x-risk, so we suspect that forecasts are strongly influenced by the priors with which people enter the community. The XPT report notes that “Few minds were changed during the XPT, even among the most active participants, and despite monetary incentives for persuading others.” In a follow-up study, they found that many of the disagreements were due to fundamental worldview differences that go beyond AI.

To reemphasize, our points about bias are specific to AI x-risk. If there were a community of election forecasters who were systematically biased (say, toward incumbents), this would become obvious after a few elections when comparing predictions with reality. But with AI x-risk, as we showed in the previous section, skill evaluation is insensitive to overestimation of tail risks.

Interestingly, skill evaluation is extremely sensitive to underestimation of tail risks: if you assign a probability of 0 for a rare event that actually ends up occurring, you incur an infinite penalty under the logarithmic scoring rule, from which you can never recover regardless of how well you predicted other events. This is considered one of the main benefits of the logarithmic score and is the reason it is adopted by Metaculus.

Now consider a forecaster who doesn’t have a precise estimate — and surely no forecaster has a precise estimate for something with so many axes of uncertainty as AI x-risk. Given the asymmetric penalties, the rational thing to do is to go with the higher end of their range of estimates.4

In any case, it’s not clear what forecasters actually report when their estimates are highly uncertain. Maybe they don’t respond to the incentives of the scoring function. After all, long-term forecasts won’t be resolved anytime soon. And recall that in the case of the XPT, the incentive is actually to predict each others’ forecasts to get around the problem of long time horizons. The reciprocal scoring paper argues that this will incentivize forecasters to submit their true, high-effort estimates, and considers various objections to this claim. Their defense of the method rests on two key assumptions: that by exerting more effort forecasters can get closer to the true estimate, and that they have no better way to predict what other forecasters will do.

What if these assumptions are not satisfied? As we have argued throughout this post, with AI x-risk, we shouldn’t expect evidence to change forecasters’ prior beliefs, so the first assumption is dubious. And now that one iteration of the XPT has concluded, the published median estimates from that tournament serve as a powerful anchor (a “focal point” in game theory). It is possible that in the future, forecasters with reciprocal scoring incentives will use existing median forecasts as a starting point, only making minor adjustments to account for new information that has become available since the last tournament. The range of estimates might narrow as existing estimates serve as anchors for future estimates. All that is a roundabout way to say: the less actual evidence there is to draw upon, the more the risk of groupthink.

Beware Pascal’s wager: the dangers of utility maximization

For what it’s worth, here are the median estimates from the XPT of both extinction risk and sub-extinction catastrophic risks from AI:

To reiterate, our view is that we shouldn’t take any of these numbers too seriously. They are a reflection of how much different samples of participants fret about AI than anything else.

As before, the estimates from forecasting experts (superforecasters) and AI experts differ by an order of magnitude or more. To the extent that we put any stock into these estimates, it should be the forecasting experts’ rather than the AI experts’ estimates. One important insight from past research is that domain experts perform worse than forecasting experts who have training in integrating diverse information and by minimizing psychological biases. Still, as we said above, even their forecasts may be vast overestimates, and we just can’t know for sure.

So what’s the big deal? So what if policymakers believe the risk over a certain timeframe is 1% instead of 0.01%? It seems pretty low in either case!

It depends on what they do with those probabilities. Most often, these estimates are merely a way to signal the fact that some group of experts thinks the risk is significant. If that’s all they are, so be it. But it’s not clear that all this elaborate effort at quantification is even helpful for this signaling purpose, given that different people interpret the same numbers wildly differently.

For example, Federal Trade Commission chair Lina Khan said her views on the matter were techno-optimistic since her p(doom) was only 15%, which left experts bewildered. (For what it’s worth, that number is about a thousandfold higher than what we would be comfortable labeling techno-optimist.) It takes a lot of quantitative training to be able to mentally process very small or very large numbers correctly in decision making, and not simply bucket them into categories like “insignificantly small”. Most people are not trained this way.

In short, what seems to be happening is that experts’ vague intuitions and fears are being translated into pseudo-precise numbers, and then translated back into vague intuitions and fears by policymakers. Let’s just cut the charade of quantification! The Center for AI Safety’s Statement on AI Risk was admirably blunt in this regard (of course, we strongly disagree with its substance).

A principled, quantitative way to use probabilities in decision making is utility maximization through cost-benefit analysis. The idea is simple: if we consider an outcome to have a subjective value, or utility, of U (which can be positive or negative), and it has, say, a 10% probability of occurring, we can act as if it is certain to occur and has a value of 0.1 * U. We can then add up the costs and benefits for each option available to us, and choose the one that maximizes costs minus benefits (the “expected utility”).

This is where things get really problematic. First, some people might consider extinction to have an unfathomably large negative value, because it precludes the existence of all the human lives, physical or simulated, that might ever be born in the future. The logical conclusion is that x-risk should be everyone’s top priority all the time! It is reminiscent of Pascal’s wager, the argument that it is rational believe in God because even if there is an infinitesimally small chance that God exists, the cost of non-belief is infinite (an eternity in hell as opposed to eternal happiness), and hence so is the expected utility. Fortunately, policymakers don’t give too much credence to decision making frameworks involving infinities. But the idea has taken a powerful hold of the AI safety community and drives some people’s conviction that AI x-risk should be society’s top priority.

Even if we limit ourselves to catastrophic but not existential risks, we are talking about billions of lives on the line, so the expected cost of even a 1% risk is so high that the policy implications are drastic — governments should increase spending on AI x-risk mitigation by orders of magnitude and consider draconian measures such as stopping AI development. This is why it is so vital to understand that these estimates are not backed by any methodology. It would be incredibly unwise to make world-changing policy decisions based on so little evidence.

Forecasts of milestones suffer from outcome ambiguity

Is there a role for forecasting in AI policy? We think yes — just not forecasting existential risk. Forecasting AI milestones, such as performance on certain capability benchmarks or economic impacts, is more achievable and meaningful. If a forecaster has demonstrated skill in predicting when various AI milestones would be reached, it does give us evidence that they will do well in the future. We are no longer talking about unique or rare events. And when considering lower-stakes policy interventions — preparing for potential economic disruption rather than staving off killer robots — it is less critical that forecasts be justified to the satisfaction of every reasonable person.

The forecasting community devotes a lot of energy to milestone forecasting. On Metaculus, the question “Will there be Human-machine intelligence parity before 2040?” has an aggregate prediction of 96% based on over 1,300 forecasters. That’s remarkable! If we agreed with this forecast, we would be in favor of the position that managing the safe transition to AGI should be a global priority. Why don’t we?

The answer is in the fine print. There is no consensus on the definition of a fuzzy concept such as AGI. Even if we fix a definition, determining whether it has been achieved can be hard or impossible. For effective forecasting, it is extremely important to avoid ambiguous outcomes. The way the forecasting community gets around this is by defining it in terms of relatively narrow skills, such as exam performance.

The Metaculus intelligence parity question is defined in terms of the performance on graduate exams in math, physics, and computer science. Based on this definition, we do agree with the forecast of 96%. But we think the definition is so watered down that it doesn’t mean much for policy. Forget existential risk — as we’ve written before, AI performance on exams has so little construct validity that it doesn’t even let us predict whether AI will replace workers.

Other benchmarks aren’t much better. In short, forecasting AI capability timelines is tricky because of the huge gap between benchmarks and real-world implications. Fortunately, better benchmarks reflecting consequential real-world tasks are being developed. In addition to benchmarks, we need naturalistic evaluation, even if it is more costly. One type of naturalistic evaluation is to measure how people perform their jobs differently with AI assistance. Directly forecasting economic, social, or political impacts — such as labor market transformation or AI-related spending by militaries — could be even more useful, although harder to unambiguously define and measure.

Concluding thoughts

The responsibility for avoiding misuses of probability in policy lies with policymakers. We are not calling for forecasters to stop publishing forecasts in order to “protect” policymakers from being misled. That said, we think forecasts should be accompanied by a clear explanation of the process used and evidence considered. This would allow policymakers to make informed decisions about whether the justification presented meets the threshold that they are comfortable with. The XPT is a good example of transparency, as is this paper (though it is not about x-risk). On the other hand, simply surveying a bunch of researchers and presenting aggregate numbers is misinformative and should be ignored by policymakers.

So what should governments do about AI x-risk? Our view isn’t that they should do nothing. But they should reject the kind of policies that might seem compelling if we view x-risk as urgent and serious, notably: restricting AI development. As we’ll argue in a future essay in this series, not only are such policies unnecessary, they are likely to increase x-risk. Instead, governments should adopt policies that are compatible with a range of possible estimates of AI risk, and are on balance helpful even if the risk is negligible. Fortunately, such policies exist. Governments should also change policymaking processes so that they are more responsive to new evidence. More on all that soon.

New paper: AI agents that matter

Sayash Kapoor — Wed, 03 Jul 2024 16:00:56 GMT

Some of the most exciting applications of large language models involve taking real-world action, such as booking flight tickets or finding and fixing software bugs. AI systems that carry out such tasks are called agents. They use LLMs in combination with other software to use tools such as web search and code terminals.

The North Star of this field is to build assistants like Siri or Alexa and get them to actually work — handle complex tasks, accurately interpret users’ requests, and perform reliably. But this is far from a reality, and even the research direction is fairly new. To stimulate the development of agents and measure their effectiveness, researchers have created benchmark datasets. But as we’ve said before, LLM evaluation is a minefield, and it turns out that agent evaluation has a bunch of additional pitfalls that affect today’s benchmarks and evaluation practices. This state of affairs encourages the development of agents that do well on benchmarks without being useful in practice.

We have released a new paper that identifies the challenges in evaluating agents and proposes ways to address them. Read the paper here. The authors are Sayash Kapoor, Benedikt Ströbl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan, all at Princeton University.

In this post, we offer thoughts on the definition of AI agents, why we are cautiously optimistic about the future of AI agent research, whether AI agents are more hype or substance, and give a brief overview of the paper.

What does the term agent mean? Is it just a buzzword?

The term agent has been used by AI researchers without a formal definition.1 This has led to its being hijacked as a marketing term, and has generated a bit of pushback against its use. But the term isn’t meaningless. Many researchers have tried to formalize the community's intuitive understanding of what constitutes an agent in the context of language-model-based systems [1, 2, 3, 4, 5]. Rather than a binary, it can be seen as a spectrum, sometimes denoted by the term 'agentic'.

The five recent definitions of AI agents cited above are all distinct but with strong similarities to each other. Rather than propose a new definition, we identified three clusters of properties that cause an AI system to be considered more agentic according to existing definitions:

Environment and goals. The more complex the environment, the more AI systems operating in that environment are agentic. Complex environments are those that have a range of tasks and domains, multiple stakeholders, a long time horizon to take action, and unexpected changes. Further, systems that pursue complex goals without being instructed on how to pursue the goal are more agentic.

User interface and supervision. AI systems that can be instructed in natural language and act autonomously on the user's behalf are more agentic. In particular, systems that require less user supervision are more agentic. For example, chatbots cannot take real-world action, but adding plugins to chatbots (such as Zapier for ChatGPT) allows them to take some actions on behalf of users.

System design. Systems that use tools (like web search or code terminal) or planning (like reflecting on previous outputs or decomposing goals into subgoals) are more agentic. Systems whose control flow is driven by an LLM, rather than LLMs being invoked by a static program, are more agentic.

Do agents even work?

While some agents such as ChatGPT’s code interpreter / data analysis mode have been useful, more ambitious agent-based products so far have failed. The two main product launches based on AI agents have been the Rabbit R1 and Humane AI pin. These devices promised to eliminate or reduce phone dependence, but turned out to be too slow and unreliable. Devin, an “AI software engineer”, was announced with great hype 4 months ago, but has been panned in a video review and remains in waitlist-only mode. It is clear that if AI agents are to be useful in real-world products, they have a long way to go.

Source

So are AI agents all hype? It’s too early to tell. We think there are research challenges to be solved before we can expect agents such as the ones above to work well enough to be widely adopted. The only way to find out is through more research, so we do think research on AI agents is worthwhile.

One major research challenge is reliability — LLMs are already capable enough to do many tasks that people want an assistant to handle, but not reliable enough that they can be successful products. To appreciate why, think of a flight-booking agent that needs to make dozens of calls to LLMs. If each of those went wrong independently with a probability of, say, just 2%, the overall system would be so unreliable as to be completely useless (this partly explains some of the product failures we’ve seen). So research on improving reliability might have many new applications even if the underlying language models don’t improve. And if scaling runs out, agents are the most natural direction for further progress in AI.

Right now, however, research is itself contributing to hype and overoptimism because evaluation practices are not rigorous enough, much like the early days of machine learning research before the common task method took hold. That brings us to our paper.

Contributions of the paper

What changes must the AI community implement to help stimulate the development of AI agents that are useful in the real world, and not just on benchmarks? This is the paper’s central question. We make five recommendations:

1. Implement cost-controlled evaluations. The language models underlying most AI agents are stochastic. This means simply calling the underlying model multiple times can increase accuracy. We show that such simple tricks can outperform complex agent architectures on the HumanEval benchmark, while costing much less. We argue that all agent evaluation must control for cost. (We originally published this finding here. In the two months since we published this post, Pareto curves and joint optimization of cost and accuracy have become increasingly common in agent evaluations.)

2. Jointly optimize accuracy and cost. Visualizing evaluation results as a Pareto curve of accuracy and inference cost opens up a new space of agent design: jointly optimizing the two metrics. We show how we can lower cost while maintaining accuracy on HotPotQA by implementing a modification to the DSPy framework.

3. Distinguish model and downstream benchmarking. Through a case study of NovelQA, we show how benchmarks meant for model evaluation can be misleading when used for downstream evaluation. We argue that downstream evaluation should account for dollar costs, rather than proxies for cost such as the number of model parameters.

4. Prevent shortcuts in agent benchmarks. We show that many types of overfitting to agent benchmarks are possible. We identify 4 levels of generality of agents and argue that different types of hold-out samples are needed based on the desired level of generality. Without proper hold-outs, agent developers can take shortcuts, even unintentionally. We illustrate this with a case study of the WebArena benchmark.

5. Improve the standardization and reproducibility of agent benchmarks. We found pervasive shortcomings in the reproducibility of WebArena and HumanEval evaluations. These errors inflate accuracy estimates and lead to overoptimism about agent capabilities.

Concluding thoughts: reasons for cautious optimism

AI agent benchmarking is new and best practices haven't yet been established, making it hard to distinguish genuine advances from hype. We think agents are sufficiently different from models that benchmarking practices need to be rethought. In our paper, we take the first steps toward a principled approach to agent benchmarking. We hope these steps will raise the rigor of AI agent evaluation and provide a firm foundation for progress.

A different strand of our research concerns the reproducibility crisis in ML-based research in scientific fields such as medicine or social science. At some level, our current paper is similar. In ML-based science, our outlook is that things will get worse before they get better. But in AI agents research, we are cautiously optimistic that practices will change quickly. One reason is that there is a stronger culture of sharing code and data alongside published papers, so errors are easier to spot. (This culture shift came about due to concerted efforts in the last five years.) Another reason is that overoptimistic research quickly gets a reality check when products based on misleading evaluations end up flopping. This is going to be an interesting space to watch over the next few years, both in terms of research and product releases.

In traditional AI, agents are defined entities that perceive and act upon their environment, but that definition is less useful in the LLM era — even a thermostat would qualify as an agent under that definition.

AI scaling myths

Arvind Narayanan — Thu, 27 Jun 2024 18:16:55 GMT

So far, bigger and bigger language models have proven more and more capable. But does the past predict the future?

One popular view is that we should expect the trends that have held so far to continue for many more orders of magnitude, and that it will potentially get us to artificial general intelligence, or AGI.

This view rests on a series of myths and misconceptions. The seeming predictability of scaling is a misunderstanding of what research has shown. Besides, there are signs that LLM developers are already at the limit of high-quality training data. And the industry is seeing strong downward pressure on model size. While we can't predict exactly how far AI will advance through scaling, we think there’s virtually no chance that scaling alone will lead to AGI.

Scaling “laws” are often misunderstood

Research on scaling laws shows that as we increase model size, training compute, and dataset size, language models get “better”. The improvement is truly striking in its predictability, and holds across many orders of magnitude. This is the main reason why many people believe that scaling will continue for the foreseeable future, with regular releases of larger, more powerful models from leading AI companies.

But this is a complete misinterpretation of scaling laws. What exactly is a “better” model? Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.

Emergence is not governed by any law-like behavior. It is true that so far, increases in scale have brought new capabilities. But there is no empirical regularity that gives us confidence that this will continue indefinitely.1

Why might emergence not continue indefinitely? This gets at one of the core debates about LLM capabilities — are they capable of extrapolation or do they only learn tasks represented in the training data? The evidence is incomplete and there is a wide range of reasonable ways to interpret it. But we lean toward the skeptical view. On benchmarks designed to test the efficiency of acquiring skills to solve unseen tasks, LLMs tend to perform poorly.

If LLMs can't do much beyond what's seen in training, at some point, having more data no longer helps because all the tasks that are ever going to be represented in it are already represented. Every traditional machine learning model eventually plateaus; maybe LLMs are no different.

Trend extrapolation is baseless speculation

Another barrier to continued scaling is obtaining training data. Companies are already using all the readily available data sources. Can they get more?

This is less likely than it might seem. People sometimes assume that new data sources, such as transcribing all of YouTube, will increase the available data volume by another order of magnitude or two. Indeed, YouTube has a remarkable 150 billion minutes of video. But considering that most of that has little or no usable audio (it is instead music, still images, video game footage, etc.), we end up with an estimate that is much less than the 15 trillion tokens that Llama 3 is already using — and that’s before deduplication and quality filtering of the transcribed YouTube audio, which is likely to knock off at least another order of magnitude.2

People often discuss when companies will “run out” of training data. But this is not a meaningful question. There’s always more training data, but getting it will cost more and more. And now that copyright holders have wised up and want to be compensated, the cost might be especially steep. In addition to dollar costs, there could be reputational and regulatory costs because society might push back against data collection practices.

We can be certain that no exponential trend can continue indefinitely. But it can be hard to predict when a tech trend is about to plateau. This is especially so when the growth stops suddenly rather than gradually. The trendline itself contains no clue that it is about to plateau.

CPU clock speeds over time. The y-axis is logarithmic. [Source]

Two famous examples are CPU clock speeds in the 2000s and airplane speeds in the 1970s. CPU manufacturers decided that further increases to clock speed were too costly and mostly pointless (since CPU was no longer the bottleneck for overall performance), and simply decided to stop competing on this dimension, which suddenly removed the upward pressure on clock speed. With airplanes, the story is more complex but comes down to the market prioritizing fuel efficiency over speed.3

Flight airspeed records over time. The SR-71 Blackbird record from 1976 still stands today. [Source]

With LLMs, we may have a couple of orders of magnitude of scaling left, or we may already be done. As with CPUs and airplanes, it is ultimately a business decision and fundamentally hard to predict in advance.

On the research front, the focus has shifted from compiling ever-larger datasets to improving the quality of training data. Careful data cleaning and filtering can allow building equally powerful models with much smaller datasets.4

Synthetic data is not magic

Synthetic data is often suggested as the path to continued scaling. In other words, maybe current models can be used to generate training data for the next generation of models.

But we think this rests on a misconception — we don't think developers are using (or can use) synthetic data to increase the volume of training data. This paper has a great list of uses for synthetic data for training, and it's all about fixing specific gaps and making domain-specific improvements like math, code, or low-resource languages. Similarly, Nvidia's recent Nemotron 340B model, which is geared at synthetic data generation, targets alignment as the primary use case. There are a few secondary use cases, but replacing current sources of pre-training data is not one of them. In short, it's unlikely that mindless generation of synthetic training data will have the same effect as having more high-quality human data.

There are cases where synthetic training data has been spectacularly successful, such as AlphaGo, which beat the Go world champion in 2016, and its successors AlphaGo Zero and AlphaZero. These systems learned by playing games against themselves; the latter two did not use any human games as training data. They used a ton of calculation to generate somewhat high-quality games, used those games to train a neural network, which could then generate even higher-quality games when combined with calculation, resulting in an iterative improvement loop.

Self-play is the quintessential example of “System 2 --> System 1 distillation”, in which a slow and expensive “System 2” process generates training data to train a fast and cheap “System 1” model. This works well for a game like Go which is a completely self-contained environment. Adapting self-play to domains beyond games is a valuable research direction. There are important domains like code generation where this strategy may be valuable. But we certainly can’t expect indefinite self-improvement for more open-ended tasks, say language translation. We should expect domains that admit significant improvement through self-play to be the exception rather than the rule.

Models have been getting smaller but are being trained for longer

Historically, the three axes of scaling — dataset size, model size, and training compute — have progressed in tandem, and this is known to be optimal. But what will happen if one of the axes (high-quality data) becomes a bottleneck? Will the other two axes, model size and training compute, continue to scale?

Based on current market trends, building bigger models does not seem like a wise business move, even if it would unlock new emergent capabilities. That’s because capability is no longer the barrier to adoption. In other words, there are many applications that are possible to build with current LLM capabilities but aren’t being built or adopted due to cost, among other reasons. This is especially true for “agentic” workflows which might invoke LLMs tens or hundreds of times to complete a task, such as code generation.

In the past year, much of the development effort has gone into producing smaller models at a given capability level.5 Frontier model developers no longer reveal model sizes, so we can’t be sure of this, but we can make educated guesses by using API pricing as a rough proxy for size. GPT-4o costs only 25% as much as GPT-4 does, while being similar or better in capabilities. We see the same pattern with Anthropic and Google. Claude 3 Opus is the most expensive (and presumably biggest) model in the Claude family, but the more recent Claude 3.5 Sonnet is both 5x cheaper and more capable. Similarly, Gemini 1.5 Pro is both cheaper and more capable than Gemini 1.0 Ultra. So with all three developers, the biggest model isn’t the most capable!

Training compute, on the other hand, will probably continue to scale for the time being. Paradoxically, smaller models require more training to reach the same level of performance. So the downward pressure on model size is putting upward pressure on training compute. In effect, developers are trading off training cost and inference cost. The earlier crop of models such as GPT-3.5 and GPT-4 was under-trained in the sense that inference costs over the model's lifetime are thought to dominate training cost. Ideally, the two should be roughly equal, given that it is always possible to trade off training cost for inference cost and vice versa. In a notable example of this trend, Llama 3 used 20 times as many training FLOPs for the 8 billion parameter model as the original Llama model did at roughly the same size (7 billion).

The ladder of generality

One sign consistent with the possibility that we won’t see much more capability improvement through scaling is that CEOs have been greatly tamping down AGI expectations. Unfortunately, instead of admitting they were wrong about their naive “AGI in 3 years” predictions, they've decided to save face by watering down what they mean by AGI so much that it's meaningless now. It helped that AGI was never clearly defined to begin with.

Instead of viewing generality as a binary, we can view it as a spectrum. Historically, the amount of effort it takes to get a computer to program a new task has decreased. We can view this as increasing generality. This trend began with the move from special-purpose computers to Turing machines. In this sense, the general-purpose nature of LLMs is not new.

This is the view we take in the AI Snake Oil book, which has a chapter dedicated to AGI. We conceptualize the history of AI as a punctuated equilibrium, which we call the ladder of generality (which isn’t meant to imply linear progress). Instruction-tuned LLMs are the latest step in the ladder. An unknown number of steps lie ahead before we can reach a level of generality where AI can perform any economically valuable job as effectively as any human (which is one definition of AGI).

Historically, standing on each step of the ladder, the AI research community has been terrible at predicting how much farther you can go with the current paradigm, what the next step will be, when it will arrive, what new applications it will enable, and what the implications for safety are. That is a trend we think will continue.

Scientists should use AI as a tool, not an oracle

Arvind Narayanan — Mon, 03 Jun 2024 18:34:13 GMT

Who produces AI hype? As we discuss in the AI Snake Oil book, it is not just companies and the media but also AI researchers. For example, a pair of widely-publicized papers in Nature in December 2023 claimed to have discovered over 2.2 million new materials using AI, and robotically synthesized 41 of them. Unfortunately, the claims were quickly debunked: “Most of the [41] materials produced were misidentified, and the rest were already known”. As for the large dataset, examining a sample of 250 compounds showed that it was mostly junk.

A core selling point of machine learning is discovery without understanding, which is why errors are particularly common in machine-learning-based science. Three years ago, we compiled evidence revealing that an error called leakage — the machine learning version of teaching to the test — was pervasive, affecting hundreds of papers from 17 disciplines. Since then, we have been trying to understand the problem better and devise solutions.

This post presents an update. In short, we think things will get worse before they get better, although there are glimmers of hope on the horizon.

The carnage continues

In our most recent compilation, the number of disciplines where researchers have uncovered leakage in published work has reached 30. The majority are medical fields, which we strongly suspect is due to the fact that since errors in medical research can be particularly consequential, medical fields seem to put much more effort into establishing best practices and critically reviewing previously published work. About 650 papers across all fields are affected, which we hypothesize is a vast underestimate — when researchers look for leakage systematically, in many fields they find that the majority of sampled studies commit the error of leakage.

Leakage is one of many reasons for reproducibility failures. There are widespread shortcomings in every step of ML-based science, from data collection to preprocessing and reporting results. Problems that might lead to irreproducibility include improper comparisons to baselines, unrepresentative samples, results being sensitive to specific modeling choices, and not reporting model uncertainties. There is also the basic problem of researchers failing to publish their code and data, precluding reproducibility. For example, Gabelica et al. examined 333 open-access journals indexed on BioMed Central in January 2019 and found that out of the 1,800 papers that pledged to share data upon request, 93% did not do so.

The roots run deep

Even before ML, many scientific fields have been facing reproducibility and replicability crises. The root causes include the publish-or-perish culture in science, the strong bias for publishing positive results (and the near-impossibility of publishing negative results), the lack of incentives for debunking faulty studies, and the lack of consequences for publishing shoddy work. For example, faulty papers are almost never retracted. Peers don’t even seem to notice replication failures — after a paper fails to replicate, only 3% of citing articles cited the replication attempt.1 Science communicators love to claim that science self-corrects, but self-correction is practically nonexistent in our experience.

All of these cultural factors are also present in ML-based science. But ML introduces a bunch of additional reasons why we should be skeptical of published results. Performance evaluation is notoriously tricky and many aspects of it, such as uncertainty quantification, are unresolved research areas. Also, ML code tends to vastly more complex and less standardized than traditional statistical modeling. Since it is not peer reviewers’ job to review code, coding errors are rarely discovered.

But we think the biggest reason for the poor quality of research is pervasive hype, resulting in the lack of a skeptical mindset among researchers, which is a cornerstone of good scientific practice. We’ve observed that when researchers have overoptimistic expectations, and their ML model performs poorly, they assume that they did something wrong and tweak the model, when in fact they should strongly consider the possibility that they have run up against inherent limits to predictability. Conversely, they tend to be credulous when their model performs well, when in fact they should be on high alert for leakage or other flaws. And if the model performs better than expected, they assume that it has discovered patterns in the data that no human could have thought of, and the myth of AI as an alien intelligence makes this explanation seem readily plausible.

This is a feedback loop. Overoptimism fuels flawed research which further misleads other researchers in the field about what they should and shouldn’t expect AI to be able to do. In fact, we’ve encountered extreme versions of this in private correspondence with frustrated researchers: since flawed research goes uncorrected, it becomes literally impossible to publish good research since it will result in models that don’t beat the “state of the art”.

The more powerful and more black-box the tool, the more the potential for errors and overconfidence. The replication crises in psychology, medicine, etc. were the result of misapplication of plain old statistics. Given how relatively new ML is, our guess is that the reproducibility crisis in ML-based science will get worse for a while before it starts to get better. And now scientists are embracing large language models and generative AI, which open up many new pitfalls such as the illusion of understanding.

Glimmers of hope

One good thing about ML-based science is that it usually involves only data analysis, not experimenting on people. So other researchers should in principle be able to download a paper’s code and data and check whether they can reproduce the reported results. They can also review the code for any errors or problematic choices. This is time consuming, but much less so than replicating a study in psychology or medicine, which is typically almost as costly as the original study.

Another good thing is that the vast majority of errors can be avoided if the researchers know what to look out for. In contrast, mitigations for the replication crisis in statistical science, such as pre-registration, have a much more spotty track record of effectiveness.

So we think that the problem can be greatly mitigated by a culture change where researchers systematically exercise more care in their work and reproducibility studies are incentivized. The ML methods community has already moved in this direction via the common task method (which is decades old) and the reproducibility challenge (which is more recent), but this has not yet happened in ML-based science, that is, in disciplines like medicine or psychology that use ML models to advance knowledge in their respective fields.

We have led a few efforts to change this. First, our leakage paper has had an impact. It has been used by researchers to clarify how they build models and document and demonstrate the absence of leakage. It has been used by researchers trying to find leakage in published work. It has also been used as a way to underscore the importance of studying leakage and coming up with discipline-specific guidelines.

Beyond leakage, we led a group of 19 researchers across computer science, data science, social sciences, mathematics, and biomedical research to develop the REFORMS checklist for ML-based science. It is a 32-item checklist that can help researchers catch eight kinds of common pitfalls in ML-based science, of which leakage is only one. It was recently published in Science Advances. Of course, checklists by themselves won’t help if there isn’t a culture change, but based on the reception so far, we are cautiously optimistic.

Concluding thoughts

Our point isn’t that AI is useless to scientists. We ourselves frequently use AI as a tool, even in our research that’s not about AI. The key word is tool. AI is not a revolution. It is not a replacement for human understanding — to think so is to miss the point of science. AI does not offer a shortcut to the hard work and frustration inherent to research. AI is not an oracle and cannot see the future.

Unfortunately, most scientific fields have succumbed to AI hype, leading to a suspension of common sense. For example, a line of research in political science claimed to predict the onset of civil war with an accuracy2 of well over 90%, a number that should sound facially impossible. (It turned out to be leakage, which is what got us interested in this whole line of research.)

We are at an interesting moment in the history of science. Look at these graphs showing the adoption of AI in various fields:3

Percentage of AI-engaged papers by field, 1985–2023 by field. (Source: Duede et al. 2024)

These hockey stick graphs are not good news. They should be terrifying. Adopting AI requires changes to scientific epistemology.4 No scientific field has the capacity to accomplish this on a timescale of a couple of years. This is not what happens when a tool or method is adopted organically. It happens when scientists jump on a trend to get funding. Given the level of hype, scientists don’t need additional incentives to adopt AI. That means AI-for-science funding programs are probably making things worse. We doubt the avalanche of flawed research can be stopped, but if at least a fraction of AI-for-science funding were diverted to better training, critical inquiry, meta-science, reproducibility, and other quality-control efforts, the havoc can be minimized.

Our book AI Snake Oil is now available to preorder. If you have enjoyed our blog and would like to support our work, please preorder via Amazon, Bookshop, or your favorite bookseller.

To be clear, replication failures don’t necessarily imply flaws in the original study. Our concern in this post is primarily about relatively clear-cut errors such as leakage.

Accuracy here refers to a metric called AUC; the baseline AUC is 50% even when one outcome (peace) is much more common than the other (war).

The paper clubs together different types of AI “engagement”: Engagement could include (but is not limited to) the development of novel AI theory and approaches, technologies, or applications; the general use of AI models for domain-specific tasks; and critical engagement with AI, as typified by academic discourse in fields like philosophy and ethics. This is unfortunate for our purposes, as our concern is solely about the second category, the use of AI for domain-specific tasks. We do think that outside of a few fields like computer science and philosophy, most AI engagement falls into this category.

In particular, as the saying goes, “all models are wrong but some models are useful”. There is no straightforward answer to the question of when we can draw conclusions about the world based on a model, so validity has to be re-litigated in every field and for every type of model.

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

Sayash Kapoor — Tue, 30 Apr 2024 14:03:42 GMT

By Sayash Kapoor, Benedikt Stroebl, Arvind Narayanan

Which is the most accurate AI system for generating code? Surprisingly, there isn’t currently a good way to answer questions like these.

Based on HumanEval, a widely used benchmark for code generation, the most accurate publicly available system is LDB (short for LLM debugger).1 But there’s a catch. The most accurate generative AI systems, including LDB, tend to be agents,2 which repeatedly invoke language models like GPT-4. That means they can be orders of magnitude more costly to run than the models themselves (which are already pretty costly). If we eke out a 2% accuracy improvement for 100x the cost, is that really better?

In this post, we argue that:

AI agent accuracy measurements that don’t control for cost aren’t useful.
Pareto curves can help visualize the accuracy-cost tradeoff.
Current state-of-the-art agent architectures are complex and costly but no more accurate than extremely simple baseline agents that cost 50x less in some cases.
Proxies for cost such as parameter count are misleading if the goal is to identify the best system for a given task. We should directly measure dollar costs instead.
Published agent evaluations are difficult to reproduce because of a lack of standardization and questionable, undocumented evaluation methods in some cases.

Maximizing accuracy can lead to unbounded cost

LLMs are stochastic. Simply calling a model many times and outputting the most common answer can increase accuracy.

On some tasks, there is seemingly no limit to the amount of inference compute that can improve accuracy.3 Google Deepmind's AlphaCode, which improved accuracy on automated coding evaluations, showed that this trend holds even when calling LLMs millions of times.

The accuracy of AlphaCode on coding tasks continues to improve even after making a million calls to the underlying model (the different curves represent varying parameter counts). Accuracy is measured by how often one of the top 10 answers generated by the model is correct.

A useful evaluation of agents must therefore ask: What did it cost? If we don’t do cost-controlled comparisons, it will encourage researchers to develop extremely costly agents just to claim they topped the leaderboard.

In fact, when we evaluate agents that have been proposed in the last year for solving coding tasks, we find that visualizing the tradeoff between cost and accuracy yields surprising insights.

Visualizing the accuracy-cost tradeoff on HumanEval, with new baselines

We re-evaluated the accuracy of three agents that have been claimed to occupy top spots on the HumanEval leaderboard: LDB, LATS, and Reflexion.4 We also evaluated the cost and time requirements of running these agents.

These agents rely on running the code generated by the model, and if it fails the test cases provided with the problem description, they try to debug the code, look at alternative paths in the code generation process, or "reflect" on why the model's outputs were incorrect before generating another solution.

In addition, we calculated the accuracy, cost, and running time of a few simple baselines:

GPT-3.5 and GPT-4 models (zero shot; no agent architecture5)
Retry: We repeatedly invoke a model with the temperature set to zero, up to five times, if it fails the test cases provided with the problem description.6 Retrying makes sense because LLMs aren’t deterministic even at temperature zero.
Warming: This is the same as the retry strategy, but we gradually increase the temperature of the underlying model with each run, from 0 to 0.5. This increases the stochasticity of the model and, we hope, increases the likelihood that at least one of the retries will succeed.
Escalation: We start with a cheap model (Llama-3 8B) and escalate to more expensive models (GPT-3.5, Llama-3 70B, GPT-4) if we encounter a test case failure.7

Surprisingly, we are not aware of any papers that compare their proposed agent architectures with any of the latter three simple baselines.

Our most striking result is that agent architectures for HumanEval do not outperform our simpler baselines despite costing more. In fact, agents differ drastically in terms of cost: for substantially similar accuracy, the cost can differ by almost two orders of magnitude!8 Yet, the cost of running these agents isn't a top-line metric reported in any of these papers.9

Our simple baselines offer Pareto improvements over existing agent architectures. We run each agent five times and report the mean accuracy and the mean total cost on the 164 HumanEval problems. Where results for LDB have two models/agents in parenthesis, they indicate the language model or agent used to generate the code, followed by the language model used to debug the code. Where they have just one, they indicate that the same model was used to both generate the code and debug it. Note that the y-axis is shown from 0.7 to 1; figures with the full axis (0 to 1) and error bars, robustness checks, and other details about our empirical results are included in the appendix.

There is no significant accuracy difference between the warming strategy and the best-performing agent architecture. Yet, Reflexion and LDB cost over 50% more than the warming strategy,10 and LATS over 50 times more (all these costs are entirely or predominantly from calls to GPT-4, so these ratios will be stable even if model costs change). Meanwhile, the escalation strategy strictly improves accuracy while costing less than half of LDB (GPT-3.5) at current inference prices.11

Our results point to another underlying problem: papers making claims about the usefulness of agents have so far failed to test if simple agent baselines can lead to similar accuracy. This has led to widespread beliefs among AI researchers that complex ideas like planning, reflection, and debugging are responsible for accuracy gains. In fact, Lipton and Steinhardt noted a trend in the AI literature of failing to identify the sources of empirical gains back in 2018.

Based on our findings, the question of whether debugging, reflection, and other such “System 2” approaches are useful for code generation remains open.12 It is possible that they will be useful on harder programming tasks than those represented in HumanEval. For now, the over-optimism about System 2 approaches is exacerbated by a lack of reproducibility and standardization that we report below.13

Proxies for cost are misleading

At first glance, reporting dollar costs is jarring. It breaks many properties of benchmarking that we take for granted: that measurements don’t change over time (whereas costs tend to come down) and that different models compete on a level playing field (whereas some developers may benefit from economies of scale, leading to lower inference costs). Because of this, researchers usually pick a different axis for the Pareto curve, such as parameter count.

The downsides of reporting costs are real, but we describe below how they can be mitigated. More importantly, we think using attributes like parameter count as a proxy for cost is a mistake and doesn’t solve the problem it’s intended to solve. To understand why, we need to introduce a conceptual distinction.

AI evaluations serve at least two distinct purposes. Model developers and AI researchers use them to identify which changes to the training data and architecture improve accuracy. We call this model evaluation. And downstream developers, such as programmers who use AI to build consumer-facing products, use evaluations to decide which AI systems to use in their products. We call this downstream evaluation.

The difference between model evaluation and downstream evaluation is underappreciated. This has led to much confusion about how to factor in the cost of running AI.

Model evaluation is a scientific question of interest to researchers. So it makes sense to stay away from dollar costs for the aforementioned reasons. Instead, controlling for compute is a reasonable approach: if we normalize the amount of compute used to train a model, we can then understand if factors like architectural changes or changes in the data composition are responsible for improvements, as opposed to more compute. Notably, Nathan Lambert argues that many of the accuracy gains in the last year (such as Meta's Llama 2) are simply consequences of using more compute.

On the other hand, downstream evaluation is an engineering question that helps inform a procurement decision. Here, cost is the actual construct of interest. The downsides of cost measurement aren’t downsides at all; they are exactly what’s needed. Inference costs do come down over time, and that greatly matters to downstream developers. It is unnecessary and counterproductive for the evaluation to stay frozen in time.

In this context, proxies for cost (such as the number of active parameters or amount of compute used) are misleading. For example, Mistral released the figure below alongside their latest model, Mixtral 8x22B, to explain why developers should choose it over competitors.

Substituting active parameters as a proxy for cost is misleading. Source: Mistral.

In this figure, the number of active parameters is a poor proxy for cost. On Anyscale, Mixtral 8x7B costs twice as much as Llama 2 13B, yet Mistral's figure shows it costs about the same, because they only consider the number of active parameters. Of course, downstream developers don't care about the number of active parameters when they're using an API. They simply care about the dollar cost relative to accuracy. Mistral chose “active parameters” as a proxy, presumably because it makes their models look better than dense models such as Meta’s Llama and Cohere’s Command R+. If we start using proxies for cost, every model developer can pick a proxy that makes their model look good.

Some hurdles to cost evaluation remain. Different providers can charge different amounts for the same model, the cost of an API call might change overnight, and cost might vary based on model developer decisions, such as whether bulk API calls are charged differently. These downsides can be partly addressed by making the evaluation results customizable using mechanisms to adjust the cost of running models, i.e., providing users the option to adjust the cost of input and output tokens for their provider of choice to recalculate the tradeoff between cost and accuracy. In turn, downstream evaluations of agents should include input/output token counts in addition to dollar costs, so that anyone looking at the evaluation in the future can instantly recalculate the cost using current prices.

But ultimately, despite the hurdles, good measurement requires modeling the underlying construct of interest. For downstream evaluations, that underlying construct is cost. All other proxies are lacking.

Agent evaluations lack standardization and reproducibility

In the course of our evaluation, we found many shortcomings in the reproducibility and standardization of agent evaluations.

We were unable to reproduce the results of the LATS and LDB agents on HumanEval. In particular, across all 5 runs for LDB (Reflexion, GPT-3.5), the maximum accuracy was 91.5%, much lower than the 95.1% reported in the paper.14 The maximum accuracy of LATS across all five runs was similarly lower, at 91.5% instead of 94.4%.
Similarly, the accuracy for the baseline GPT-4 model reported in the LDB paper is drastically lower than our reproduction of the paper's code (75.0% vs. a mean of 89.6% across five runs). In fact, according to the paper, the GPT-3.5 and GPT-4 models perform very similarly (73.9% vs. 75.0%).15 Weak baselines could give a false sense of the amount of improvement attributable to the agent architecture.
The LATS agent was evaluated on only a subset of the test cases provided in the HumanEval benchmark. This exaggerated their accuracy numbers, since the code for a particular HumanEval problem might be incorrect, but if it passes only a portion of the test cases for that problem, it could still be marked as correct. In our analysis, this was responsible for a 3% difference in accuracy (mean across five runs), which explains a substantial part of the difference between the accuracy we found and the one reported in the paper. In addition, many details about the implementation, such as hyperparameter values, were not reported in the paper or GitHub repository (see appendix for details).
To the best of our knowledge, this post is the first time the four agents with the highest accuracy—Retry, Warming, LDB (GPT-4), and LDB (GPT-4 + Reflexion)—have been tested on HumanEval.16
Reflexion, LDB, and LATS all use different subsets of HumanEval. Three (out of 164) coding problems in the original version of HumanEval lack example tests. Since these agents require example tests to debug or rerun their solutions, Reflexion removes the three problems that don't have example tests. LATS removes these three problems, plus another problem, for unreported reasons.17 LDB adds example tests for the three problems that are missing in the original benchmark. None of the three papers reports this. The paper introducing LATS claims (incorrectly): "We use all 164 problems for our experiments."18 In our analysis, we conducted all evaluations on the version of the benchmark provided by LDB, since it contains example tests for all problems.
The LDB paper claims to use GPT-3.5 for code generation using Reflexion: "For Reflexion, we select the version based on GPT-3.5 and utilize the corresponding generated programs published in the official Github repository." However, the generated program they used from the Reflexion repository relies on GPT-4 for code generation, not GPT-3.5.19

These shortcomings in the empirical results have also led to errors of interpretation in broader discussions around the accuracy of AI agents. For example, a recent post by Andrew Ng claimed that agents that use GPT-3.5 can outperform GPT-4. In particular, he claimed:

[For HumanEval,] GPT-3.5 (zero shot) was 48.1% correct. GPT-4 (zero shot) does better at 67.0%. However, the improvement from GPT-3.5 to GPT-4 is dwarfed by incorporating an iterative agent workflow. Indeed, wrapped in an agent loop, GPT-3.5 achieves up to 95.1%.

While this claim received a lot of attention, it is incorrect. The claim ("GPT-3.5 wrapped in an agent workflow achieves 95.1% accuracy") seems to be about the LDB agent. The Papers With Code leaderboard for HumanEval makes the same claim. However, as we discussed above, for LDB, GPT-3.5 is only used to find bugs. The code is generated using GPT-4 (or the Reflexion agent that uses GPT-4), not GPT-3.5. Unfortunately, the error in the paper has led to much overoptimism about agents in the broader AI community.

Ng's post also makes the familiar error of repeating results from papers without verifying them or accounting for changes in prompts and model versions. For example, the zero-shot accuracy numbers of GPT-3.5 (48.1%) and GPT-4 (67.0%) seem to be copied from the GPT-4 technical report from March 2023. However, the models have been updated many times since release. Indeed, in our comparison, we find that the base models perform much better compared to the claimed figures in Ng's post when we use them with the prompts provided with the LDB paper (GPT-3.5: 73.9%, GPT-4: 89.6%). As a result, the post drastically overestimates the improvement attributable to agent architectures.

Evaluation frameworks like Stanford's HELM and EleutherAI's LM Evaluation Harness attempt to fix similar shortcomings for model evaluations, by providing standardized evaluation results. We are working on solutions to make agent evaluations standardized and reproducible, especially from the perspective of downstream evaluation of agents.

Finally, downstream developers should keep in mind that HumanEval or any other standardized benchmark is nothing more than a rough proxy for the specific tasks that arise in a particular downstream application. To understand how agents will perform in practice, it is necessary to evaluate them on a custom dataset from the domain of interest — or even better, A/B test different agents in the production environment.

Acknowledgments

We thank Rishi Bommasani, Rumman Chowdhury, Percy Liang, Shayne Longpre, Yifan Mai, Nitya Nadgir, Matt Salganik, Hailey Schoelkopf, Zachary Siegel, and Venia Veselovsky for discussions and inputs that informed our analysis. We acknowledge Cunxiang Wang and Ruoxi Ning for their prompt responses to our questions about the NovelQA benchmark.

We are grateful to the authors of the papers we engage with in this post for their quick responses and for sharing their code, which makes such reproduction analysis possible in the first place. In particular, we are grateful to Zilong Wang (LDB), Andy Zhou (LATS), and Karthik Narasimhan (Reflexion), who gave us feedback in response to an earlier draft of this blog post.

The leaderboard on the linked page lists AgentCoder as the most accurate system. However, the code or data for reproducing the results of this agent are not available online, so we do not consider it in this blog post.

This post is about agents. Leaderboards are also becoming less useful for evaluating the underlying models. There are many problems, including gameability. But controlling for inference cost isn’t the main problem, so our arguments don’t necessarily apply.

Tasks where increased compute could help indefinitely are primarily those where verifying whether a solution is correct is easy. In the case of programming questions, this takes the form of test cases that are provided with each question to check if the answer is correct. Other examples include proving theorems, because verifying if a theorem is correct can be straightforward, as well as some tasks on the internet for agents that navigate the web. That said, even for tasks where there is no way to guess a solution and then verify, the costs of different agents can vary by orders of magnitude.

We included agents from the HumanEval leaderboard on PapersWithCode that share their code publicly. Reflexion is absent from the PapersWithCode list, but it has a reported accuracy of 91% (higher than any other agents with publicly available code apart from LDB and LATS), so we included it too.

For the model evaluation, we only used the description of the coding problem as well as the example tests provided with the HumanEval dataset. Three of the 164 coding problems in HumanEval lack example tests. The authors of LDB include a modified version of HumanEval with example tests included for these three problems. We use this modified version for all experiments.

In all of the baselines we provide, we don't use the test cases used to evaluate if the solution is correct when deciding to retry, only the ones in the problem description, to avoid leakage.

We evaluated Llama-3 using together.ai endpoints. The cost per million tokens on together.ai, for both prompt and completion, is 0.20$ and 0.90$ for Llama-3-8B and Llama-3-70B, respectively.

This is also true for other desired properties of agents, such as running time. We report results for time vs. accuracy tradeoffs in the appendix.

While some of the papers introducing these models discuss cost abstractly, such as the relationship between cost and number of times an agent retries, they don't report any concrete numbers on cost or compare token count to a baseline.

The cost comparison is for LDB (Reflexion, GPT-3.5), since that is the top-performing agent reported by the authors of LDB.

In addition to HumanEval, we also ran experiments on the HotPotQA and NovelQA benchmarks for question answering. We found similar results for both benchmarks. In particular, we found that there can be large differences in cost underlying small improvements in accuracy for both benchmarks.

One potential concern with our analysis is that while we relied on the April 2024 version of OpenAI models, many papers relied on older model versions for their results. To address this, we report results for an additional robustness check with the June 2023 version of OpenAI models in the appendix; we find substantially similar results across model versions.

While HumanEval is commonly used to evaluate how well AI can solve coding problems, it is limited due to its small size (only 164 questions), lack of difficult problems (none of the problems involve real-world tasks), and potential contamination, since language models have likely been trained on HumanEval problems, which might inflate the performance of the simple baselines we test. A more rigorous examination of hypotheses related to whether System 2 thinking helps will likely require the use of more comprehensive and robust benchmarks, such as SWE-bench.

LDB uses already-existing solutions to improve them by debugging. The existing solutions can come from models like GPT-3.5 or GPT-4, or from agents like Reflexion. Since the authors of Reflexion provided all of the generated solutions in their Github repo, the authors of the LDB paper used code from the original Reflexion repository to run their analysis, rather than rerunning the Reflexion agent. The difference between the reported results and our reproduced results could be due to differences in the code generated by the Reflexion agent. Reusing Reflexion solutions is a reasonable choice for evaluating the usefulness of debugging (indeed, we see LDB increases the accuracy over using the models alone). The problem arises when their final accuracy is interpreted as a downstream evaluation, since it might give developers an inflated estimate of the accuracy of such techniques for coding.

The authors acknowledge this and plan to update their results.

The authors of LDB only tested the GPT-3.5 model as the debugger, which performed notably worse than the agent using GPT-4 as the debugger, with an accuracy of 88.9% for LDB (GPT-3.5 + Reflexion) vs. 92.9% for LDB (GPT-4 + Reflexion).

In correspondence with the authors of LATS, they clarified: "Originally, there was an execution error when evaluating some test cases for [one of the HumanEval test cases], so we opted to remove it from our setting."

The authors acknowledge this and plan to update the paper to address it.

AI as Normal Technology

Open-world evaluations for measuring frontier AI capabilities

Open-world evaluations are an important emerging class of AI evaluation

Benchmarks can both overestimate and underestimate progress

What are open-world evaluations?

An incomplete survey of open-world evaluations

Limitations of open-world evaluations

How different stakeholders can use open-world evaluations

Introducing CRUX: Collaborative Research for Updating AI eXpectations

CRUX #1: Can AI agents autonomously develop and publish an iOS app?

Our setup for the agent

Addressing evaluation awareness

Conducting dry runs

The final evaluation

Lessons for open-world evaluations

Author contributions and acknowledgments

New Paper: Towards a science of AI agent reliability

Table of Contents

Accuracy isn’t enough: four dimensions of reliability

Capability gains are rapid, but improvements in reliability are modest

Why we could be wrong

What should deployers do differently?

What should researchers and developers do differently?

What do our findings mean for AI progress?

Further reading

AI Won’t Automatically Make Legal Services Cheaper

Table of Contents

Why Legal Services Are So Expensive

Why AI Won’t Help by Default

Regulatory Barriers

Adversarial Dynamics

Litigation

Transactional Work

Human Oversight

Institutional Reforms

Reforming Professional Regulation

Clarifying Unauthorized Practice of Law Rules

Alternative Business Structures

Regulatory Markets

Reforming Adjudication

Judicial Case Management

Arbitration

The Evolving Role of Lawyers

Expanding the Judiciary

In-House Counsel as Strategic Advisors

Conclusion

Fact checking Moravec's paradox

The evidence behind the paradox is flaky

A flawed evolutionary argument

How simplistic models have misled AI researchers and tech leaders

Conclusion: if not Moravec’s paradox, then what?

Other videos

A guide to understanding AI as normal technology

Normal doesn’t mean mundane or predictable

A restatement of our thesis

If disappointment about GPT-5 has nudged you towards AI as normal technology, it’s possible you don’t quite understand the thesis

Why it’s hard to find a “middle ground” between AI as Normal Technology and AI 2027

It is hard to understand one worldview when you’re committed to another

Reaping AI’s benefits will require hard work and painful choices

The surreal debate about the speed of diffusion

Why AI adoption hits different

Concluding thoughts

Further reading/viewing

Could AI slow science?

Table of contents

Science has been slowing — the production-progress paradox

Why is progress slowing? Can AI help?

Science is not ready for software, let alone AI

AI might prolong the reliance on flawed theories

Human understanding remains essential

Implications for the future of science

Changing scientific practices

Investing in meta-science

Reforming incentives

Rethinking AI-for-science tools

Final thoughts

Further reading

AGI is not a milestone

Nuclear weapons as an anti-analogy for AGI

It isn’t crazy to think that o3 is AGI, but this says more about AGI than o3