ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understand…| www.alignmentforum.org
We prototype using mechanistic interpretability to derive and formally verify guarantees on model performance in a toy setting.| www.alignmentforum.org
An informal description of ARC’s current research approach, follow-up to Eliciting Latent Knowledge| Alignment Research Center