Suppose that I try to define a human’s decision process by observing some decisions and conditioning the universal prior on agreement with those decisions (see here). I have argued that the b…| Ordinary Ideas
I have described a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human’s observed behavior. I would like…| Ordinary Ideas
Although I don’t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one whi…| Ordinary Ideas
An allegory Consider a human controlling a very fast car on a busy street using counterfactual oversight. The car is perfectly capable of driving safely. But what happens in the 1% of cases where the car decides to pause and ask the human to review its proposed behavior, or to suggest an action? Without some […]| Ordinary Ideas
Suppose that I’ve taken lots of videos of people performing activities, paid Bob to label each one with a description of the activity, and then trained a classifier X using that training data. Maybe some of those videos include behaviors like “labelling training data for the classifier X,” “collecting training data for the classifier X,” etc. […]| Ordinary Ideas
For supervised learning systems, serious concerns about “extortion” are essentially equivalent to concerns about simulations. These problems can’t really be resolved by better decision theory, they can only be resolved by pursuing unsupervised objectives or addressing concerns about simulations. Supervised learners As examples, consider either: a reinforcement learning system, trained to maximize the discounted sum […]| Ordinary Ideas
I’ve recently spent some more time thinking about speculative issues in AI safety: Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising. A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion […]| Ordinary Ideas
My current attitude towards the Löbian obstacle is “just live with it.” This post outlines that view and some of the underlying intuitions. To show I’m being a good sport, I’…| Ordinary Ideas
My current preferred formalization of extrapolation of an agent’s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could u…| Ordinary Ideas
Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do “upon reflection.” That is, in g…| Ordinary Ideas
Suppose I want to provide a completely precise specification of “me,” or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, …| Ordinary Ideas
Will machine intelligences communicate with humans by directly exposing or reporting properties of their internal state, or will they tend to communicate by strategically choosing utterances that t…| Ordinary Ideas
Suppose that we use the universal prior for sequence prediction, without regard for computational complexity. I think that the result is going to be really weird, and that most people don’t a…| Ordinary Ideas