Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit...| arXiv.org
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an ini...| arXiv.org
Thanks to Roger Grosse, Cem Anil, Sam Bowman, Tamera Lanham, and Mrinank Sharma for helpful discussion and comments on drafts of this post. …| www.alignmentforum.org
Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately an...| arXiv.org
A high-level view on how this new approach fits into our alignment plans| aligned.substack.com
Bootstrapping a solution to the alignment problem| aligned.substack.com
How to scale alignment techniques to hard tasks| aligned.substack.com