When using a fuzzy, intuitive approach, it’s easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings.| www.alignmentforum.org
An agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.| www.alignmentforum.org