Login
From:
www.alignmentforum.org
(Uncensored)
subscribe
Why Do Some Language Models Fake Alignment While Others Don't? — AI Alignment Forum
https://www.alignmentforum.org/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don
links
backlinks
Roast topics
Find topics
Find it!
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduc…