The SAgE group works on advancing the systematic study and evaluation of AI agents.
A survey of an emerging paradigm for evaluating frontier AI through long-horizon, real-world tasks that complement traditional benchmarks.
A project for regularly conducting open-world evaluations to track AI capabilities. In CRUX #1, we tasked an agent with building and publishing an iOS app to the App Store.
Forthcoming in ICML 2026 ↗
Understanding and improving the reliability of AI agents.
Published in ICLR 2026 ↗
The standardized, cost-aware, and third-party leaderboard for evaluating agents.
Agent benchmarks that report only pass/fail outcomes lose critical information; systematic log analysis is necessary to assess internal validity, external validity, and safety.
Published in TMLR 2025 ↗
The computational reproducibility agent benchmark. It consists of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine).
Published in TMLR 2025 ↗
Our analysis of current agent benchmarks and evaluation practices that highlights several shortcomings that hinder their usefulness in real-world applications.
Accepted to ICLR 2026 ↗
Paper on the limits of LLM resampling with imperfect verifiers. We analyze when inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.
Why we need evaluations beyond benchmarks, and how to do them well
Defining and measuring the capability-reliability gap in AI agents
Why AGI does not represent a discontinuity in AI capabilities or impacts
Making sense of recent technology trends and claims
A new benchmark to measure the impact of AI on improving science
Rethinking AI agent benchmarking and evaluation
What spending $2,000 can tell us about evaluating AI agents
Coverage of our research on the capability-reliability gap in AI agents.
Keynote on the state of AI agent evaluation and where the field needs to go.
Discussion on agent benchmarking challenges and the gap between agent hype and real-world performance.
Discussion on the range of agentic behaviors, the challenges in benchmarking agents, and the 'capability and reliability gap', which creates risks when deploying AI agents in real-world applications.
Deep dive into the methodologies and challenges of evaluating AI agents, discussing the importance of reproducible benchmarks and cost-aware evaluation.
Coverage of our work on agent evaluation methodologies and pitfalls.
Review of our work on improving AI agent evaluation and benchmarking practices.
Discussion on AI progress and evaluation challenges.
Analysis of GPT-5's performance on coding and software engineering tasks.
Discussion on developing richer composite metrics for reliability in AI agent evaluation.
Funded by Coefficient Giving, Schmidt Sciences, the Princeton AI Lab, the Princeton Language and Intelligence Initiative, and the Princeton Catalysis Initiative. We are grateful to OpenAI and Google for providing API credits to evaluate their models.