If we’ve decided we’re collectively fine with unleashing millions of spam bots, then the least we can do is actually study what they can – and can’t – do.| Planned Obsolescence
If we can accurately recognize good performance on alignment, we could elicit lots of useful alignment work from our models, even if they're playing the training game.| Planned Obsolescence
We're creating incentives for AI systems to make their behavior look as desirable as possible, while intentionally disregarding human intent when that conflicts with maximizing reward.| Planned Obsolescence