o3 and o4-mini are large language models recently realeased by OpenAI and augmented with chain-of-thought reinforcement learning, designed to “think before they speak” by generating explicit, multi-step reasoning before producing an answer. How good these reasoning models are in chess? Using a simple four-move sequence, I suceed to force o3 into an illegal move, and across multiple matches both o3 and o4-mini struggle dramatically, by generating illegal moves in over 90% of cases and even...| blog.mathieuacher.com
I come to the conclusion that DeepSeek-R1 is worse than a 5 years-old version of GPT-2 in chess… The very recent, state-of-art, open-weights model DeepSeek R1 is breaking the 2025 news, excellent in many benchmarks, with a new integrated, end-to-end, reinforcement learning approach to large language model (LLM) training. I am personally very excited about this model, and I’ve been working on it in the last few days, confirming that DeepSeek R1 is on-par with GPT-o for several tasks. Yet, ...| blog.mathieuacher.com
I find a way to systematically force a win in 4 moves against the last release of OpenAI (ChatGPT-4o) and in 7 moves against the best GPT at chess gpt-3.5-turbo-instruct, estimated at 1750 Elo. Clickbait: you don’t need to master chess to beat GPTs, you can well be 700 Elo, just follow the moves ;-) The story behind the discovery is interesting and worth sharing. I also discuss the generalization of the results and the implications for the future of GPTs and chess. Don’t fear the clickbai...| blog.mathieuacher.com
(“make LLMs play better with one weird trick”)| DYNOMIGHT
Oracle is the first chess engine that plays like a human, from amateur to super GM.| Yosha Chess