We now know that“LLMs + simple RL” is working really well. So, we’re well on track for superhuman performance in domains where we have many examples of good reasoning and good outcomes to train on. We have or can get such datasets for surprisingly many domains. We may also lack them in surprisingly many. How well can our models generalise to“fill in the gaps”? It’s an empirical question, not yet settled. But Dario and Sam clearly think“very well”. Dario, in particular, is sayi...