Here are two problems you’ll face if you’re an AI company building and using powerful AI: …| www.alignmentforum.org
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduc…| www.alignmentforum.org
Abstract In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and ins…| www.alignmentforum.org
An agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.| www.alignmentforum.org
Imagine the space of goals as a two-dimensional map, where simple goals take up more area. Some goals are unstable, in that an agent with that goal will naturally change into having a different goal. We can arrange goal-space...| www.alignmentforum.org
In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they appl…| www.alignmentforum.org