Figure 1 Placeholder. Notes on how to implement alignment in AI systems. This is necessarily a fuzzy concept, because Alignment is fuzzy and AI is fuzzy. We need to make peace with the frustrations of this fuzziness and move on. 1 Fine tuning to do nice stuff Think RLHF, Constitutional AI etc. I’m not greatly persuaded that these are the right way to go, but they are interesting. 2 Classifying models as unaligned I’m familiar only with mechanistic interpretability at the moment; I’m su...| The Dan MacKinlay stable of variably-well-consider’d enterprises
Notes on AI Alignment Fast-Track - Losing control to AI 1 Session 1 What is AI alignment? – BlueDot Impact More Is Different for AI Paul Christiano, What failure looks like 👈 my favourite. Cannot believe I hadn’t read this. AI Could Defeat All Of Us Combined Why AI alignment could be hard with modern deep learning Terminology I should have already known but didn’t: Convergent Instrumental Goals. Self-Preservation Goal Preservation Resource Acquisition Self-Improvement Ajeya Cotra’s...| The Dan MacKinlay stable of variably-well-consider’d enterprises
Figure 1 Placeholder, for notes on what kind of world models reside in neural nets. 1 Incoming NeurIPS 2023 Tutorial: Language Models meet World Models 2 References Basu, Grayson, Morrison, et al. 2024. “Understanding Information Storage and Transfer in Multi-Modal Large Language Models.” Chirimuuta. 2025. “The Prehistory of the Idea That Thinking Is Modelling.”Human Arenas. Ge, Huang, Zhou, et al. 2024. “WorldGPT: Empowering LLM as Multimodal World Model.” In Proceedings of the ...| The Dan MacKinlay stable of variably-well-consider’d enterprises