The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. Update January 2024: we have paused hiring and expect to reopen in the second half of 2024. We are open to expressions of interest but do not plan| Alignment Research Center
This post is an elaboration on “tractability of discrimination” as introduced in section III of "Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see "Mechanistic anomaly detection" and "Finding gliders in the game of life".| Alignment Research Center
Finding explanations is a relatively unambitious interpretability goal. If it is intractable then that’s an important obstacle to interpretability in general. If we formally define “explanations,” then finding them is a well-posed search problem and there is a plausible argument for tractability.| Alignment Research Center