The SAgE group works on advancing the systematic study and evaluation of AI agents.
Accepted to ICLR 2026 ↗
The standardized, cost-aware, and third-party leaderboard for evaluating agents.
Published in TMLR 2025 ↗
The computational reproducibility agent benchmark. It consists of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine).
Published in TMLR 2025 ↗
Our analysis of current agent benchmarks and evaluation practices that highlights several shortcomings that hinder their usefulness in real-world applications.
Accepted to ICLR 2026 ↗
Paper on the limits of LLM resampling with imperfect verifiers. We analyze when inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.
Workshop on how to make agents useful in the real world and at scale. Over 600 attendees. Speakers from industry and academia.
Making sense of recent technology trends and claims
A new benchmark to measure the impact of AI on improving science
Rethinking AI agent benchmarking and evaluation
What spending $2,000 can tell us about evaluating AI agents
Discussion on the range of agentic behaviors, the challenges in benchmarking agents, and the 'capability and reliability gap', which creates risks when deploying AI agents in real-world applications.
Deep dive into the methodologies and challenges of evaluating AI agents, discussing the importance of reproducible benchmarks and cost-aware evaluation.
Coverage of our work on agent evaluation methodologies and pitfalls.
Review of our work on improving AI agent evaluation and benchmarking practices.
Discussion on AI progress and evaluation challenges.
Analysis of GPT-5's performance on coding and software engineering tasks.
Funded by Coefficient Giving, Schmidt Sciences, the Princeton AI Lab, the Princeton Language and Intelligence Initiative, and the Princeton Catalysis Initiative. We are grateful to OpenAI and Google for providing API credits to evaluate their models.