Rafael Harth's profile on the AI Alignment Forum — A community blog devoted to technical AI alignment research| www.alignmentforum.org
Rohin Shah's profile on the AI Alignment Forum — A community blog devoted to technical AI alignment research| www.alignmentforum.org
John Schulman's profile on the AI Alignment Forum — A community blog devoted to technical AI alignment research| www.alignmentforum.org
danieldewey's profile on the AI Alignment Forum — A community blog devoted to technical AI alignment research| www.alignmentforum.org
Beth Barnes's profile on the AI Alignment Forum — A community blog devoted to technical AI alignment research| www.alignmentforum.org
Akbir Khan's profile on the AI Alignment Forum — A community blog devoted to technical AI alignment research| www.alignmentforum.org
Note: If you’ll forgive the shameless self-promotion, applications for my MATS stream are open until Sept 12. I help people write a mech interp paper…| www.alignmentforum.org
There’s a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples…| www.alignmentforum.org
Interpretability provides access to AI systems' internal mechanisms, offering a window into how models process information and make decisions.| www.alignmentforum.org
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper dem…| www.alignmentforum.org
Highly compressed technical summary [NOT intended to be widely legible]: A preliminary investigation of action evidential decision theory with respec…| www.alignmentforum.org
Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees.…| www.alignmentforum.org
In Corrigibility Can Be VNM-Incoherent, I operationalized an agent's corrigibility as our ability to modify the agent so that it follows different po…| www.alignmentforum.org
Intuitions can only get us so far in understanding corrigibility. Let’s dive into some actual math!| www.alignmentforum.org
Human values are functions of latent variables in our minds. But those variables may not correspond to anything in the real world. How can an AI opti…| www.alignmentforum.org
Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much fas…| www.alignmentforum.org
AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimiz…| www.alignmentforum.org
GDP isn't a great metric for AI timelines or takeoff speed because the relevant events (like AI alignment failure or progress towards self-improving…| www.alignmentforum.org
Summary * In August 2020, we conducted an online survey of prominent AI safety and governance researchers. You can see a copy of the survey at this…| www.alignmentforum.org
A few months after writing this post I realized that one of the key arguments was importantly flawed. I therefore recommend against inclusion in the…| www.alignmentforum.org
Here are two problems you’ll face if you’re an AI company building and using powerful AI: …| www.alignmentforum.org
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduc…| www.alignmentforum.org
Abstract In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and ins…| www.alignmentforum.org
When using a fuzzy, intuitive approach, it’s easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings.| www.alignmentforum.org
Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth …| www.alignmentforum.org
This is the second of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Mach…| www.alignmentforum.org
An agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.| www.alignmentforum.org
Imagine the space of goals as a two-dimensional map, where simple goals take up more area. Some goals are unstable, in that an agent with that goal will naturally change into having a different goal. We can arrange goal-space...| www.alignmentforum.org
Comment by Sammy Martin - Some points that didn't fit into the main post: The full survey results break down as 16 % 'Superintelligence' (i.e. some version of 'brain-in-a-box'), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each d...| www.alignmentforum.org
This is the second post in a sequence mapping out the AI Alignment research landscape. The sequence will likely never be completed, but you can read…| www.alignmentforum.org
Thanks to Chris Olah, Neel Nanda, Kate Woolverton, Richard Ngo, Buck Shlegeris, Daniel Kokotajlo, Kyle McDonell, Laria Reynolds, Eliezer Yudkowksy, M…| www.alignmentforum.org
ARC explores the challenge of extracting information from AI systems that isn't directly observable in their outputs, i.e "eliciting latent knowledge…| www.alignmentforum.org
This is a response to ARC's first technical report: Eliciting Latent Knowledge. But it should be fairly understandable even if you didn't read ARC's…| www.alignmentforum.org
I wrote this post to get people’s takes on a type of work that seems exciting to me personally; I’m not speaking for Open Phil as a whole. Institutio…| www.alignmentforum.org
Thanks to Roger Grosse, Cem Anil, Sam Bowman, Tamera Lanham, and Mrinank Sharma for helpful discussion and comments on drafts of this post. …| www.alignmentforum.org
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that migh…| www.alignmentforum.org
This is the long-form version of a public comment on Anthropic's Towards Monosemanticity paper …| www.alignmentforum.org
I’ll attempt to build up details around what I mean by “corrigibility” through small stories about a purely corrigible agent whom I’ll call Cora, and her principal, who I’ll name Prince.| www.alignmentforum.org
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understand…| www.alignmentforum.org
We prototype using mechanistic interpretability to derive and formally verify guarantees on model performance in a toy setting.| www.alignmentforum.org
This is an update on the work on AI Safety via Debate that we previously wrote about here. …| www.alignmentforum.org
Tl;dr We want to be able to supervise models with superhuman knowledge of the world and how to manipulate it. For this we need an overseer to be able…| www.alignmentforum.org
Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human…| www.alignmentforum.org
The original draft of Ayeja's report on biological anchors for AI timelines. The report includes quantitative models and forecasts, though the specif…| www.alignmentforum.org
We are no longer accepting submissions. We'll get in touch with winners and make a post about winning proposals sometime in the next month. …| www.alignmentforum.org
Behavior cloning (BC) is, put simply, when you have a bunch of human expert demonstrations and you train your policy to maximize likelihood over the…| www.alignmentforum.org