The SAgE group works on advancing the systematic study and evaluation of AI agents.
The standardized, cost-aware, and third-party leaderboard for evaluating agents.
The computational reproducibility agent benchmark. It consists of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine).
Our analysis of current agent benchmarks and evaluation practices that highlights several shortcomings that hinder their usefulness in real-world applications.
Paper on the limits of LLM resampling with imperfect verifiers. We analyze when inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.
Making sense of recent technology trends and claims
A new benchmark to measure the impact of AI on improving science
Rethinking AI agent benchmarking and evaluation
What spending $2,000 can tell us about evaluating AI agents
Discussion on the range of agentic behaviors, the challenges in benchmarking agents, and the 'capability and reliability gap', which creates risks when deploying AI agents in real-world applications.
Deep dive into the methodologies and challenges of evaluating AI agents, discussing the importance of reproducible benchmarks and cost-aware evaluation.
Coverage of our work on agent evaluation methodologies and pitfalls.
Amazon
Hugging Face
Anthropic
Stony Brook
Google DeepMind
UK AISI
UK AISI
Google DeepMind
Amazon
Apollo Research
UK AISI
Weights & Biases
MIT
UC Berkeley
Stanford
UC Berkeley
MIT
Google DeepMind
Stanford
We are studying the ability of AI agents to collectively self-improve towards a common goal.
A benchmark for automatic detection of flaws in published ML research
Understanding and improving the reliability of AI agents.