Subscribe
Sign in
Home
Notes
Start here
Get the book
Book exercises
About us
AI evaluation
Latest
Top
Discussions
New paper: AI agents that matter
Rethinking AI agent benchmarking and evaluation
Jul 3, 2024
•
Sayash Kapoor
and
Arvind Narayanan
110
6
Scientists should use AI as a tool, not an oracle
How AI hype leads to flawed research that fuels more hype
Jun 3, 2024
•
Arvind Narayanan
and
Sayash Kapoor
143
20
AI leaderboards are no longer useful. It's time to switch to Pareto curves.
What spending $2,000 can tell us about evaluating AI agents
Apr 30, 2024
•
Sayash Kapoor
and
Arvind Narayanan
84
17
Will AI transform law?
The hype is not supported by current evidence
Jan 24, 2024
•
Arvind Narayanan
and
Sayash Kapoor
66
15
How Transparent Are Foundation Model Developers?
Introducing the Foundation Model Transparency Index
Oct 18, 2023
•
Sayash Kapoor
37
9
Evaluating LLMs is a minefield
Annotated slides from a recent talk
Oct 4, 2023
•
Arvind Narayanan
and
Sayash Kapoor
90
6
Does ChatGPT have a liberal bias?
A new paper making this claim has many flaws. But the question merits research
Aug 18, 2023
•
Arvind Narayanan
and
Sayash Kapoor
37
8
Introducing the REFORMS checklist for ML-based science
ML-based science is in trouble. Clear reporting standards for researchers could help.
Aug 16, 2023
•
Sayash Kapoor
and
Arvind Narayanan
38
8
Is GPT-4 getting worse over time?
A new paper going viral has been widely misinterpreted
Jul 19, 2023
•
Arvind Narayanan
and
Sayash Kapoor
123
13
Quantifying ChatGPT’s gender bias
Benchmarks allow us to dig deeper into what causes biases and what can be done about it
Apr 26, 2023
•
Sayash Kapoor
and
Arvind Narayanan
48
13
OpenAI’s policies hinder reproducible research on language models
LLMs have become privately-controlled research infrastructure
Mar 22, 2023
•
Sayash Kapoor
and
Arvind Narayanan
37
10
GPT-4 and professional benchmarks: the wrong answer to the wrong question
OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots.
Mar 20, 2023
•
Arvind Narayanan
and
Sayash Kapoor
138
21
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts