We prototype using mechanistic interpretability to derive and formally verify guarantees on model performance in a toy setting.| www.alignmentforum.org
An informal description of ARC’s current research approach, follow-up to Eliciting Latent Knowledge| Alignment Research Center