Published on August 27, 2025 5:04 PM GMT Here's a relatively important question regarding transparency requirements for AI companies: At which points in time should AI companies be required to disclose information? (While I focus on transparency, this question is also applicable to other safety-relevant requirements, and is applicable to norms around voluntary actions rather than requirements.) A natural option would be to attach transparency requirements to the existing processes of pre-depl...| AI Alignment Forum
Published on August 27, 2025 1:00 PM GMT There are two ways to show that an AI system is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Until three months ago, AI companies said their models didn't have dangerous capabilities. (At the time, I wrote that the companies' eval reports didn't support their claims that their models lacked dangerous bio capabilities.) Now, Anthropic, OpenAI, Google DeepMind, and xAI say their ...| AI Alignment Forum
Published on August 26, 2025 5:07 PM GMT Let’s start with the classic Maxwell’s Demon setup. We have a container of gas, i.e. a bunch of molecules bouncing around. Down the middle of the container is a wall with a tiny door in it, which can be opened or closed by a little demon who likes to mess with thermodynamics researchers. Maxwell[1] imagined that the little demon could, in principle, open the door whenever a molecule flew toward it from the left, and close the door whenever a mol...| AI Alignment Forum
Published on August 26, 2025 12:18 AM GMT This is a linkpost for https://www.arxiv.org/pdf/2508.16245 With Marcus Hutter, Jan Leike (@janleike), and Jessica Taylor (@jessicata) , I have revisited Leike et al.'s paper "A Formal Solution to the Grain of Truth Problem" (AFSGOTP) which studies games between reflective AIXI agents and... further formalized it. The result is "Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games" (LCGOTACEFUG) which perhaps coul...| AI Alignment Forum
Published on August 24, 2025 4:19 AM GMT These are some research notes on whether we could reduce AI takeover risk by cooperating with unaligned AIs. I think the best and most readable public writing on this topic is “Making deals with early schemers”, so if you haven't read that post, I recommend starting there. These notes were drafted before that post existed, and the content is significantly overlapping. Nevertheless, these notes do contain several points that aren’t in that post, a...| AI Alignment Forum
Published on August 22, 2025 9:46 PM GMT Our posts on natural latents have involved twodistinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjectured that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). Four months ago, we put up a bounty to prove this conjecture. We've been bottlenecked pretty hard on this problem, and spent most of...| AI Alignment Forum
Published on August 22, 2025 1:03 PM GMT Introduction Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistri...| AI Alignment Forum
Published on August 21, 2025 11:11 PM GMT This post accompanies An Introduction to Credal Sets and Infra-Bayes Learnability. Notation We use ΔX to denote the space of probability distributions over a set X, which is assumed throughout to be a compact metric space. We use □X to denote the set of credal sets over X. Given f:X→R and m∈ΔX, let m(f):=Em[f]. Let C(X,Y) denote the space of continuous functions from X to Y. Proof of Lemma 1 Lemma 1: If A and O are finit...| AI Alignment Forum
Published on August 21, 2025 10:43 PM GMT Suppose random variables X1 and X2 contain approximately the same information about a third random variable Λ, i.e. both of the following diagrams are satisfied to within approximation ϵ: "Red" for redundancyWe call Λ a "redund" over X1,X2, since conceptually, any information Λ contains about X must be redundantly represented in both X1 and X2 (to within approximation). Here's an intuitive claim which is surprisingly tricky to pro...| AI Alignment Forum
There’s a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples…| www.alignmentforum.org
I think Debate is probably the most exciting existing safety research direction. This is a pretty significant shift from my opinions when I first rea…| www.alignmentforum.org
Comment by Rohin Shah - My snapshot: https://elicit.ought.org/builder/xPoVZh7Xq Idk what we mean by "AGI", so I'm predicting when transformative AI will be developed instead. This is still a pretty fuzzy target: at what point do we say it's "transformative"? Does it have to be fully deployed and we already see the huge economic impact? Or is it just the point at which the model training is complete? I'm erring more on the side of "when the model training is complete", but also there may be lo...| www.alignmentforum.org
Interpretability provides access to AI systems' internal mechanisms, offering a window into how models process information and make decisions.| www.alignmentforum.org
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper dem…| www.alignmentforum.org
Highly compressed technical summary [NOT intended to be widely legible]: A preliminary investigation of action evidential decision theory with respec…| www.alignmentforum.org
Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees.…| www.alignmentforum.org
In Corrigibility Can Be VNM-Incoherent, I operationalized an agent's corrigibility as our ability to modify the agent so that it follows different po…| www.alignmentforum.org
Intuitions can only get us so far in understanding corrigibility. Let’s dive into some actual math!| www.alignmentforum.org
Human values are functions of latent variables in our minds. But those variables may not correspond to anything in the real world. How can an AI opti…| www.alignmentforum.org
Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much fas…| www.alignmentforum.org
AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimiz…| www.alignmentforum.org
GDP isn't a great metric for AI timelines or takeoff speed because the relevant events (like AI alignment failure or progress towards self-improving…| www.alignmentforum.org
Summary * In August 2020, we conducted an online survey of prominent AI safety and governance researchers. You can see a copy of the survey at this…| www.alignmentforum.org
A few months after writing this post I realized that one of the key arguments was importantly flawed. I therefore recommend against inclusion in the…| www.alignmentforum.org
A community blog devoted to technical AI alignment research| www.alignmentforum.org
Here are two problems you’ll face if you’re an AI company building and using powerful AI: …| www.alignmentforum.org
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduc…| www.alignmentforum.org
Abstract In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and ins…| www.alignmentforum.org
gpt2-small's head L1H5 directs attention to semantically similar tokens and actively suppresses self-attention. The head computes attention…| www.alignmentforum.org
Daniel Kokotajlo presents his best attempt at a concrete, detailed guess of what 2022 through 2026 will look like, as an exercise in forecasting. It…| www.alignmentforum.org
ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s…| www.alignmentforum.org
(Follow-up to Eliciting Latent Knowledge. Describing joint work with Mark Xu. This is an informal description of ARC’s current research approach; not…| www.alignmentforum.org
Highlights * We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agenti…| www.alignmentforum.org
This is the third of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machi…| www.alignmentforum.org
This is the fourth of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Mach…| www.alignmentforum.org
Inner alignment and objective robustness have been frequently discussed in the alignment community since the publication of “Risks from Learned Optim…| www.alignmentforum.org
Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networ…| www.alignmentforum.org
This is a quick list of interventions that might help fix issues from reward hacking. …| www.alignmentforum.org
Paper authors: Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie M…| www.alignmentforum.org
This is a list of projects[1] to consider for folks who want to use Wise AI to steer the world towards positive outcomes. …| www.alignmentforum.org
When using a fuzzy, intuitive approach, it’s easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings.| www.alignmentforum.org
Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth …| www.alignmentforum.org
This is the second of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Mach…| www.alignmentforum.org
An agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.| www.alignmentforum.org
Imagine the space of goals as a two-dimensional map, where simple goals take up more area. Some goals are unstable, in that an agent with that goal will naturally change into having a different goal. We can arrange goal-space...| www.alignmentforum.org
Comment by Sammy Martin - Some points that didn't fit into the main post: The full survey results break down as 16 % 'Superintelligence' (i.e. some version of 'brain-in-a-box'), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each d...| www.alignmentforum.org
This is the second post in a sequence mapping out the AI Alignment research landscape. The sequence will likely never be completed, but you can read…| www.alignmentforum.org
Thanks to Chris Olah, Neel Nanda, Kate Woolverton, Richard Ngo, Buck Shlegeris, Daniel Kokotajlo, Kyle McDonell, Laria Reynolds, Eliezer Yudkowksy, M…| www.alignmentforum.org
Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then i…| www.alignmentforum.org
ARC explores the challenge of extracting information from AI systems that isn't directly observable in their outputs, i.e "eliciting latent knowledge…| www.alignmentforum.org
This is a response to ARC's first technical report: Eliciting Latent Knowledge. But it should be fairly understandable even if you didn't read ARC's…| www.alignmentforum.org
I wrote this post to get people’s takes on a type of work that seems exciting to me personally; I’m not speaking for Open Phil as a whole. Institutio…| www.alignmentforum.org
Thanks to Roger Grosse, Cem Anil, Sam Bowman, Tamera Lanham, and Mrinank Sharma for helpful discussion and comments on drafts of this post. …| www.alignmentforum.org
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that migh…| www.alignmentforum.org
This is the long-form version of a public comment on Anthropic's Towards Monosemanticity paper …| www.alignmentforum.org
I’ll attempt to build up details around what I mean by “corrigibility” through small stories about a purely corrigible agent whom I’ll call Cora, and her principal, who I’ll name Prince.| www.alignmentforum.org
This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two…| www.alignmentforum.org
This is a research update on some work that I’ve been doing on Scalable Oversight at Anthropic, based on the original AI safety via debate proposal a…| www.alignmentforum.org
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understand…| www.alignmentforum.org
We prototype using mechanistic interpretability to derive and formally verify guarantees on model performance in a toy setting.| www.alignmentforum.org
In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they appl…| www.alignmentforum.org
This is an update on the work on AI Safety via Debate that we previously wrote about here. …| www.alignmentforum.org
Tl;dr We want to be able to supervise models with superhuman knowledge of the world and how to manipulate it. For this we need an overseer to be able…| www.alignmentforum.org
This is a sequence version of the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such...| www.alignmentforum.org
Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human…| www.alignmentforum.org
The following is an edited transcript of a talk I gave. I have given this talk at multiple places, including first at Anthropic and then for ELK winn…| www.alignmentforum.org
The original draft of Ayeja's report on biological anchors for AI timelines. The report includes quantitative models and forecasts, though the specif…| www.alignmentforum.org
We are no longer accepting submissions. We'll get in touch with winners and make a post about winning proposals sometime in the next month. …| www.alignmentforum.org
Behavior cloning (BC) is, put simply, when you have a bunch of human expert demonstrations and you train your policy to maximize likelihood over the…| www.alignmentforum.org