We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs.| metr.org
This is Part 0 of a four-part report — see links to Part 1. Part 2. Part 3, and a folder with more materials. Abstract In the next few decades we may develop AI that can automate ~all cognitive tasks and dramatically transform the world. By contrast, today the capabilities and impact of AI are much […]| Open Philanthropy