Login
From:
www.alignmentforum.org
(Uncensored)
subscribe
Compact Proofs of Model Performance via Mechanistic Interpretability — AI Alignment Forum
https://www.alignmentforum.org/posts/bRsKimQcPTX3tNNJZ/compact-proofs-of-model-performance-via-mechanistic
links
backlinks
Roast topics
Find topics
Find it!
We prototype using mechanistic interpretability to derive and formally verify guarantees on model performance in a toy setting.