Former ARC researcher David Matolcsi has put together a sequence of posts that explores ARC's big-picture vision for our research and examines several obstacles that we face. We think these posts will be useful to readers who are interested in digging into the details of our big-picture vision and understanding| Alignment Research Center
In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether \(\pi\) is a normal number (which, roughly speaking, means that every digit appears in \(\pi\) with equal| Alignment Research Center
Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The purpose of this post is to try to convey some of that vision and how our individual| Alignment Research Center
ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation| Alignment Research Center
Last week, ARC released a paper called Towards a Law of Iterated Expectations for Heuristic Estimators, which follows up on previous work on formalizing the presumption of independence. Most of the work described here was done in 2023. A brief table of contents for this post: What is a heuristic| Alignment Research Center
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested| Alignment Research Center
ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in| Alignment Research Center
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going| Alignment Research Center
Earlier this year ARC posted a prize for two matrix completion problems. We received a number of submissions we considered useful, but not any complete solutions. We are closing the contest and awarding the following partial prizes: * $500 to Elad Hazan for solving a related problem and pointing us to| Alignment Research Center
The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. Update January 2024: we have paused hiring and expect to reopen in the second half of 2024. We are open to expressions of interest but do not plan| Alignment Research Center
Here are two self-contained algorithmic questions that have come up in our research. We're offering a bounty of $5k for a solution to either of them—either an algorithm, or a lower bound under any hardness assumption that has appeared in the literature. Question 1 (existence of PSD completions): given| Alignment Research Center
This post is an elaboration on “tractability of discrimination” as introduced in section III of "Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see "Mechanistic anomaly detection" and "Finding gliders in the game of life".| Alignment Research Center
Finding explanations is a relatively unambitious interpretability goal. If it is intractable then that’s an important obstacle to interpretability in general. If we formally define “explanations,” then finding them is a well-posed search problem and there is a plausible argument for tractability.| Alignment Research Center
ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s output. In this post we’ll walk through a very simple example of using this approach to identify gliders in the game of life.| Alignment Research Center
An informal description of ARC’s current research approach, follow-up to Eliciting Latent Knowledge| Alignment Research Center
Roughly speaking, the goal of ELK is to incentivize ML models to honestly answer “straightforward” questions where the right answer is unambiguous and known by the model. We are offering prizes of $5,000 to $50,000 for proposed strategies for ELK.| Alignment Research Center
This is an archived version of the early 2022 hiring round page.| Alignment Research Center