Anthropic's latest interpretability research: a new microscope to understand Claude's internal mechanisms| www.anthropic.com
Today, we’re announcing Claude 3.7 Sonnet, our most intelligent model to date and the first hybrid reasoning model generally available on the market.| www.anthropic.com
Alignment Is Not All You Need: Other Problems in AI Safety| adamjones.me
The AI regulator’s toolbox: A list of concrete AI governance practices| adamjones.me
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this […]| BlueDot Impact
This article explains key concepts that come up in the context of AI alignment. These terms are only attempts at gesturing at the underlying ideas, and the ideas are what is important. There is no strict consensus on which name should correspond to which idea, and different people use the terms differently.[[1]] This article explains […]| BlueDot Impact