Science of Agent Evaluation

The SAgE group works on advancing the systematic study and evaluation of AI agents.

Princeton University CITP PLI Princeton AI Lab

Key Projects & Publications

HAL (Holistic Agent Leaderboard)

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

CORE-bench

The computational reproducibility agent benchmark. It consists of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine).

AI Agents That Matter

Our analysis of current agent benchmarks and evaluation practices that highlights several shortcomings that hinder their usefulness in real-world applications.

Inference Scaling fLaws

Paper on the limits of LLM resampling with imperfect verifiers. We analyze when inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.

Workshop on Useful and Reliable AI Agents

Workshop on how to make agents useful in the real world and at scale. Over 600 attendees. Speakers from industry and academia.

Blog Posts

AI Snake Oil Blog

Is AI Progress Slowing Down?

Making sense of recent technology trends and claims

December 2024
AI Snake Oil Blog

Can AI automate computational reproducibility?

A new benchmark to measure the impact of AI on improving science

September 2024
AI Snake Oil Blog

New paper: AI agents that matter

Rethinking AI agent benchmarking and evaluation

July 2024
AI Snake Oil Blog

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

What spending $2,000 can tell us about evaluating AI agents

April 2024

Media

AI Agents: Substance or Snake Oil with Arvind Narayanan - twiml AI Podcast

Discussion on the range of agentic behaviors, the challenges in benchmarking agents, and the 'capability and reliability gap', which creates risks when deploying AI agents in real-world applications.

AI agents that matter - Weaviate Podcast

Deep dive into the methodologies and challenges of evaluating AI agents, discussing the importance of reproducible benchmarks and cost-aware evaluation.

AI agent benchmarks are misleading, study warns - VentureBeat

Coverage of our work on agent evaluation methodologies and pitfalls.

Team

Collaborators

Amit Arora

Amazon

Aymeric Roucher

Hugging Face

Hailey Schoelkopf

Anthropic

Harsh Trivedi

Stony Brook

Iason Gabriel

Google DeepMind

Jelena Luketina

UK AISI

JJ Allaire

UK AISI

Laura Weidinger

Google DeepMind

Madhur Prashant

Amazon

Marius Hobbhahn

Apollo Research

Maximillian Kaufmann

UK AISI

Morgan McGuire

Weights & Biases

Omar Khattab

MIT

Parth Asawa

UC Berkeley

Rishi Bommasani

Stanford

Shreya Shankar

UC Berkeley

Shayne Longpre

MIT

William Isaac

Google DeepMind

Yifan Mai

Stanford

Upcoming Work

Agent Zoo

We are studying the ability of AI agents to collectively self-improve towards a common goal.

Detecting errors in ML research

A benchmark for automatic detection of flaws in published ML research

Agent Reliability

Understanding and improving the reliability of AI agents.