SAgE Research Group - Science of Agent Evaluation

Key Projects & Publications

Open-World Evaluations

A survey of an emerging paradigm for evaluating frontier AI through long-horizon, real-world tasks that complement traditional benchmarks.

Paper

CRUX: Collaborative Research for Updating AI eXpectations

Accepted at ICML 2026 workshops ↗

A project for regularly conducting open-world evaluations to track AI capabilities. In CRUX #1, we tasked an agent with building and publishing an iOS app to the App Store.

Website CRUX #1

Agent Reliability

Forthcoming in ICML 2026 ↗ · Accepted at ICML 2026 workshops ↗

Understanding and improving the reliability of AI agents.

Website Paper

HAL: The Holistic Agent Leaderboard

Published in ICLR 2026 ↗

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

Leaderboards Harness Paper

Log Analysis for Credible Agent Evaluation

Accepted at ICML 2026 workshops ↗

Agent benchmarks that report only pass/fail outcomes lose critical information; systematic log analysis is necessary to assess internal validity, external validity, and safety.

Paper

CORE-bench

Published in TMLR 2025 ↗

The computational reproducibility agent benchmark. It consists of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine).

Code Paper

AI Agents That Matter

Published in TMLR 2025 ↗

Our analysis of current agent benchmarks and evaluation practices that highlights several shortcomings that hinder their usefulness in real-world applications.

Paper

Inference Scaling fLaws

Accepted to ICLR 2026 ↗

Paper on the limits of LLM resampling with imperfect verifiers. We analyze when inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.

Paper