CAIS and Scale AI are excited to announce the launch of Humanity's Last Exam, a project aimed at measuring how close we are to achieving expert-level AI systems. The exam is aimed at building the world's most difficult public AI benchmark gathering experts across all fields. People who submit successful questions will be invited as coauthors on the paper for the dataset and have a chance to win money from a $500,000 prize pool.| Center for AI Safety
If a scientific finding is really true and important, we should be able to reproduce it- different researchers can investigate and confirm it, rather than just taking one researcher at their word. …| Economist Writing Every Day
The Systematizing Confidence in Open Research and Evidence (SCORE) project is an attempt to replicate hundreds of social science papers, and to search for patterns that predict what types of papers…| Economist Writing Every Day
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad...| arXiv.org
Anthropic's new research initiative exploring AI's impact on the future of work and economy, developing policy frameworks for a changing workforce.| www.anthropic.com
In 2023, we gathered the data for what became “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics.” Since then, LLM use has increased. A 2025 survey from Elon Universit…| Economist Writing Every Day