I split this off from clickbait bandits for discoverability, and because it has grown larger than its source notebook. Figure 1 Since the advent of the LLM era, the term human reward hacking has become salient. This is because we fine tune lots of LLMs using reinforcement learning, and RL algorithms are notoriously prone to “cheating” in a manner we interpret as “reward hacking”. Things I have been reading on this theme: Benton et al. (2024), Greenblatt et al. (2024), Laine et al. (2...| The Dan MacKinlay stable of variably-well-consider’d enterprises
Reinforcement learning meets iterated game theory meets theory of mind| The Dan MacKinlay stable of variably-well-consider’d enterprises
Fine tuning foundation models| The Dan MacKinlay stable of variably-well-consider’d enterprises
Figure 1 Learning agents in a multi-agent system which account for and/or exploit the fact that other agents are learning too. This is one way of formalising the idea of theory of mind. Learning with theory of mind works out nicely for reinforcement learning, in e.g. opponent shaping, and may be an important tool for understanding AI agency and AI alignment, as well as aligning more general human systems. Other interesting things might arise from a good theory of other-aware learning, such ...| The Dan MacKinlay stable of variably-well-consider’d enterprises