New paper walkthrough: Interpretability in the Wild: A Circuit for Indirect Object Identification In GPT-2 Small is a really exciting new mechanistic interpretability paper from Redwood Research. They reverse engineer a 26(!) head circuit in GPT-2 Small, used to solve Indirect Object Identification: the task of understanding that the sentence "After John and Mary went to the shops, John gave a bottle of milk to" should end in Mary, not John.| Neel Nanda
In a collaboration with Jess Smith, we read through the Anthropic paper Toy Models of Superposition and discuss, give intuitions and high-level takeaways. Watch it here and check out the original paper here. An explainer I wrote may be a helpful reference.| Neel Nanda
These are notes taken during a call with Itay Yona , an expert in software/hardware reverse engineering (SRE). Itay gave me an excellent distillation of key ideas and mindsets in the field, and we discussed analogies/disanalogies to mechanistic interpretability of neural networks. I’m ge| Neel Nanda
New experiment: Recording myself real-time as I do mechanistic interpretability research! I try to answer the question of what happens if you train a toy transformer without positional embeddings on the task of "predict the previous token" - turns out that a two layer model can rederive them! You can watch me do it here, and you can follow along with my code here. This uses a transformer mechanistic interpretability library I'm writing called EasyTransformer, and this was a good excuse to tes...| Neel Nanda
An intro guide to a mechanistic interpretability weekend hackathon| Neel Nanda
I've recently been having advising calls with REMIX teams (Redwood's interpretability sprint) trying to give advice & feedback on projects. As an experiment, I've published a recording of one advising call (with Tessa Barton & Kushal Jain on memorisation in GPT-2 Small), I'm curious whether this is useful to anyone! IMO getting detailed feedback from a more experienced research is one of the best ways to improve at research, but have no idea whether someone else's feedback is comparatively us...| Neel Nanda
Co-authored by Neel Nanda and Jess Smith Check out Concrete Steps for Getting Started in Mechanistic Interpretability for a better starting point Why does this exist? People often get intimidated when trying to get into AI or AI Alignment research. People often think that the gulf betwee| Neel Nanda
A write up of work extending and building on the paper Emergent World Representations| Neel Nanda
I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here:| Neel Nanda
New paper walkthrough: In-Context Learning and Induction Heads. This is the second paper in Anthropic's Transformer Circuits thread, a series of papers trying to reverse engineer transformer language models. I read through it with Charles Frye (from Full-Stack Deep Learning), and we discuss the paper, and give takes and intuitions. See the original paper and a Twitter thread of my paper takeaways| Neel Nanda
Introduction| Neel Nanda
Deprecated, see a much more up-to-date post here Disclaimer : This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. So| Neel Nanda
A highly opinionated list of what mechanistic interpretability papers to read when getting into the field| Neel Nanda
A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"| Neel Nanda
A stream of conscious video walkthrough of a Mathematical Framework for Transformer Circuits| Neel Nanda
A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable| Neel Nanda