Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze. | transformer-circuits.pub
Eliciting latent knowledge: How to tell if your eyes deceive you Paul Christiano, Ajeya Cotra, and Mark Xu Alignment Research Center December 2021 In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we ...| Google Docs
Thanks to Chris Olah, Neel Nanda, Kate Woolverton, Richard Ngo, Buck Shlegeris, Daniel Kokotajlo, Kyle McDonell, Laria Reynolds, Eliezer Yudkowksy, M…| www.alignmentforum.org
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that migh…| www.alignmentforum.org