Week 6 of the AI alignment curriculum. Interpretability is the study of ways to, well, interpret AI models, currently mainly NNs. Mechanistic interpretability This aims to understand networks on the level of individual neurons. Zoom In: an introduction to circuits (Olah et al., 2020) Claims Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood. Features are connected by weights, forming circuits.