An intro guide to a mechanistic interpretability weekend hackathon| Neel Nanda
Co-authored by Neel Nanda and Jess Smith Check out Concrete Steps for Getting Started in Mechanistic Interpretability for a better starting point Why does this exist? People often get intimidated when trying to get into AI or AI Alignment research. People often think that the gulf betwee| Neel Nanda
A write up of work extending and building on the paper Emergent World Representations| Neel Nanda
Introduction| Neel Nanda
Deprecated, see a much more up-to-date post here Disclaimer : This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. So| Neel Nanda
A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"| Neel Nanda
A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable| Neel Nanda
We’re experimenting with publishing more of our internal thoughts publicly. This piece may be less polished than our normal blog articles. Running AI Safety Fundamentals’ AI alignment and AI governance courses, we often have difficulty finding resources that hit our learning objectives well. Where we can find resources, often they’re not focused on what we want, or are hard for […]| BlueDot Impact