SAgE: Science of Agent Evaluation

The SAgE group works on advancing the systematic study and evaluation of AI agents.

Princeton University CITP PLI Princeton AI Lab

Key Projects & Publications

HAL: The Holistic Agent Leaderboard

Accepted to ICLR 2026 ↗

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

CORE-bench

Published in TMLR 2025 ↗

The computational reproducibility agent benchmark. It consists of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine).

AI Agents That Matter

Published in TMLR 2025 ↗

Our analysis of current agent benchmarks and evaluation practices that highlights several shortcomings that hinder their usefulness in real-world applications.

Inference Scaling fLaws

Accepted to ICLR 2026 ↗

Paper on the limits of LLM resampling with imperfect verifiers. We analyze when inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.

Workshop on Useful and Reliable AI Agents

Workshop on how to make agents useful in the real world and at scale. Over 600 attendees. Speakers from industry and academia.

Agent Reliability

Understanding and improving the reliability of AI agents.

Blog Posts

AI As Normal Technology Blog

Is AI Progress Slowing Down?

Making sense of recent technology trends and claims

December 2024
AI As Normal Technology Blog

Can AI automate computational reproducibility?

A new benchmark to measure the impact of AI on improving science

September 2024
AI As Normal Technology Blog

New paper: AI agents that matter

Rethinking AI agent benchmarking and evaluation

July 2024
AI As Normal Technology Blog

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

What spending $2,000 can tell us about evaluating AI agents

April 2024

Media

AI Agents: Substance or Snake Oil with Arvind Narayanan - twiml AI Podcast

Discussion on the range of agentic behaviors, the challenges in benchmarking agents, and the 'capability and reliability gap', which creates risks when deploying AI agents in real-world applications.

AI agents that matter - Weaviate Podcast

Deep dive into the methodologies and challenges of evaluating AI agents, discussing the importance of reproducible benchmarks and cost-aware evaluation.

AI agent benchmarks are misleading, study warns - VentureBeat

Coverage of our work on agent evaluation methodologies and pitfalls.

How to build a better AI benchmark - MIT Technology Review

Review of our work on improving AI agent evaluation and benchmarking practices.

Is AI hitting a wall? - Financial Times

Discussion on AI progress and evaluation challenges.

Developers Say GPT-5 Is a Mixed Bag - WIRED

Analysis of GPT-5's performance on coding and software engineering tasks.

Team

Andrew Schwartz

Andrew Schwartz

Kangheng Liu

Kangheng Liu

Peter Kirgis

Peter Kirgis

Saiteja Utpala

Saiteja Utpala

Stephan Rabanser

Stephan Rabanser

Alumni

Collaborators

Amit Arora

Botao Yu

Boyi Wei

Daniel Kang

Dawn Song

Dheeraj Oruganty

Dongyoon Hahm

Felix Chen

Franck Ndzomga

Harsh Trivedi

Huan Sun

Juyong Lee

Percy Liang

Peter Henderson

Rishi Bommasani

Sophie Luskin

Tengjun Jin

Tianci Xue

Yifan Mai

Yifei Zhou

Yu Su

Yuxuan Zhu

Ziru Chen

Funding

Funded by Coefficient Giving, Schmidt Sciences, the Princeton AI Lab, the Princeton Language and Intelligence Initiative, and the Princeton Catalysis Initiative. We are grateful to OpenAI and Google for providing API credits to evaluate their models.