DeepSeek-R1-Zero is cool. I wrote about reasoning models before o1, and I’m excited to the way this area of research has been cracked wide open, it seems. It’s also remarkably simple. I’m messing around with llama (running locally!), trying to see if I can at least partially reproduce the results (for fun). I figure I can collect reasoning chains and then adapt some existing RLHF code to fine-tune the model on successful chains vs. unsuccessful chains, maybe by prefixing the response wi...