When using a fuzzy, intuitive approach, it’s easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings.| www.alignmentforum.org
Imagine the space of goals as a two-dimensional map, where simple goals take up more area. Some goals are unstable, in that an agent with that goal will naturally change into having a different goal. We can arrange goal-space...| www.alignmentforum.org
I’ll attempt to build up details around what I mean by “corrigibility” through small stories about a purely corrigible agent whom I’ll call Cora, and her principal, who I’ll name Prince.| www.alignmentforum.org