In Corrigibility Can Be VNM-Incoherent, I operationalized an agent's corrigibility as our ability to modify the agent so that it follows different po…| www.alignmentforum.org
Intuitions can only get us so far in understanding corrigibility. Let’s dive into some actual math!| www.alignmentforum.org
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.) I have long said that the lion’s share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather […]| Machine Intelligence Research Institute
An agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.| www.alignmentforum.org
Imagine the space of goals as a two-dimensional map, where simple goals take up more area. Some goals are unstable, in that an agent with that goal will naturally change into having a different goal. We can arrange goal-space...| www.alignmentforum.org
I’ll attempt to build up details around what I mean by “corrigibility” through small stories about a purely corrigible agent whom I’ll call Cora, and her principal, who I’ll name Prince.| www.alignmentforum.org