Week 2 of the AI alignment curriculum. Reward misspecification occurs when RL agents are rewarded for misbehaving. Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. Evil genies Amounts to both the old and new understanding of hacking RLHF can help, but only if the correct reward function is learned The map is not the territory; agents l...