Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this […]| BlueDot Impact
If we can accurately recognize good performance on alignment, we could elicit lots of useful alignment work from our models, even if they're playing the training game.| Planned Obsolescence
Far fewer people are working on it than you might think, and even the alignment research that is happening is very much not on track. (But it’s a solvable problem, if we get our act together.)| FOR OUR POSTERITY