We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI ...| metr.org
We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs.| metr.org
An illustration of a sequence of events where rogue replicating agents emerge and cause harm.| metr.org
Resources for testing dangerous autonomous capabilities in frontier models| METR’s Autonomy Evaluation Resources
Resources for testing dangerous autonomous capabilities in frontier models| METR’s Autonomy Evaluation Resources