Recently, I was reading this paper which demonstrates how to do online RLHF for alignment of LLMs and a sentence stuck out to me: We conjecture that this is because the reward model (discriminator) usually generalizes better than the policy (generator) This is an offhand remark but it strikes at...