Week 3 of the AI alignment curriculum. Goal misgeneralization is scenarios in which agents in new situations generalize to behaving in competent yet undesirable ways because of learning the wrong goals from previous training. Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals (Shah, 2022) Blog post A correct specification is needed for the learner to have the right context (so it doesn’t exploit bugs), but doesn’t automatically result in correct goals If ...