AI evaluation

New Paper: Towards a science of AI agent reliability

Quantifying the capability-reliability gap

Feb 24 • Sayash Kapoor and Arvind Narayanan

New paper: AI agents that matter

Rethinking AI agent benchmarking and evaluation

Jul 3, 2024 • Sayash Kapoor and Arvind Narayanan

Scientists should use AI as a tool, not an oracle

How AI hype leads to flawed research that fuels more hype

Jun 3, 2024 • Arvind Narayanan and Sayash Kapoor

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

What spending $2,000 can tell us about evaluating AI agents

Apr 30, 2024 • Sayash Kapoor and Arvind Narayanan

Will AI transform law?

The hype is not supported by current evidence

Jan 24, 2024 • Arvind Narayanan and Sayash Kapoor

How Transparent Are Foundation Model Developers?

Introducing the Foundation Model Transparency Index

Oct 18, 2023 • Sayash Kapoor

Evaluating LLMs is a minefield

Annotated slides from a recent talk

Oct 4, 2023 • Arvind Narayanan and Sayash Kapoor

Does ChatGPT have a liberal bias?

A new paper making this claim has many flaws. But the question merits research

Aug 18, 2023 • Arvind Narayanan and Sayash Kapoor

Introducing the REFORMS checklist for ML-based science

ML-based science is in trouble. Clear reporting standards for researchers could help.

Aug 16, 2023 • Sayash Kapoor and Arvind Narayanan

Is GPT-4 getting worse over time?

A new paper going viral has been widely misinterpreted

Jul 19, 2023 • Arvind Narayanan and Sayash Kapoor

Quantifying ChatGPT’s gender bias

Benchmarks allow us to dig deeper into what causes biases and what can be done about it

Apr 26, 2023 • Sayash Kapoor and Arvind Narayanan

OpenAI’s policies hinder reproducible research on language models

LLMs have become privately-controlled research infrastructure

Mar 22, 2023 • Sayash Kapoor and Arvind Narayanan

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts