Epistemic status: Early thoughts. Some ideas but no empirical testing or validation as yet. I’ve started thinking a fair bit about reward hacking recently. This is because frontier models are reportedly beginning to show signs of reward hacking especially for coding tasks. Thus, the era of easy-to-align pretraining-only models appears...