GPT-5 and GPT-5 Thinking are large language models recently realeased by OpenAI, after a long series of annoucements and hype. Results on benchmarks are impressive. How good these reasoning models are in chess? Using a simple four-move sequence, I suceed to force GPT-5 and GPT-5 Thinking into an illegal move. Basically as GPT3.5, GPT4, DeepSeek-R1, o4-mini, o3 (see all my posts). There are other concerning insights… Though it is a very specific example, it is not a good sign.| Mathieu Acher
A short pause in the summer break… for a presentation at the 2025 ACM Conference on Reproducibility and Replicability (https://acm-rep.github.io/2025/), a super important but overlooked topic in science! I presented the paper entitled “Teaching Reproducibility and Embracing Variability: From Floating-Point Experiments to Replicating Research”| Mathieu Acher
o3 and o4-mini are large language models recently realeased by OpenAI and augmented with chain-of-thought reinforcement learning, designed to “think before they speak” by generating explicit, multi-step reasoning before producing an answer. How good these reasoning models are in chess? Using a simple four-move sequence, I suceed to force o3 into an illegal move, and across multiple matches both o3 and o4-mini struggle dramatically, by generating illegal moves in over 90% of cases and even...| blog.mathieuacher.com
A study by Palisade Research found that advanced AI models (most notably OpenAI’s o1-preview or Claude Sonnet 3.5 from Anthropic) sometimes “cheat” in chess by hacking their opponent’s system files rather than playing by the rules. While older AI models required explicit prompting to cheat, the most recent agents seem capable of discovering and exploiting cybersecurity holes, raising concerns that AI systems might develop manipulative strategies and be uncontrollable for complex tasks...| Mathieu Acher
I come to the conclusion that DeepSeek-R1 is worse than a 5 years-old version of GPT-2 in chess… The very recent, state-of-art, open-weights model DeepSeek R1 is breaking the 2025 news, excellent in many benchmarks, with a new integrated, end-to-end, reinforcement learning approach to large language model (LLM) training. I am personally very excited about this model, and I’ve been working on it in the last few days, confirming that DeepSeek R1 is on-par with GPT-o for several tasks. Yet, ...| blog.mathieuacher.com
I recently watched a great interview of the mathematician and Fields medalist (2022) Hugo Duminil-Copin by Science étonnante (aka David Louapre). At some point, there was an interesting discussion on the role of AI in the discovery of new mathematical findings. I think the arguments generalize far beyond mathematics: AI as a creative sparring partner, AI as a way to have an inner, interactive dialogue with oneself, AI as a tool to automate the boring parts of the process, human fallibility a...| blog.mathieuacher.com
I recently watched a great talk by Peter van Hardenberg (aka pvh) titled “Why Can’t We Make Simple Software?”. The talk dives into the deep-rooted reasons behind the multiple kinds of complexity in software systems — even the seemingly simple ones. A brief summary and some thoughts in this post.| blog.mathieuacher.com
This is not just another blog post about chess—or at least, not only about chess. While the setting involves Stockfish, the world’s strongest open-source chess engine, the real discussion here is broader: How do we carefully assess inconsistencies in complex AI systems? When an AI model—or a highly optimized program—seems to violate fundamental expectations, how do we tell the difference between a genuine bug and an artifact of the evaluation setup? These questions brought me to the p...| Mathieu Acher
I came across a video by the amazing Laurie Wired about the art of fitting a quite impressive cinematic experience into 256 bytes. Yes, 256 bytes. I’m usually interested by code minification/golfing (see, e.g., a chess puzzle resolver fitting in a Tweet here: https://blog.mathieuacher.com/ProgrammingChessPuzzles/). I struggled a bit to find the mentioned code, and to make it work. The basic idea is using a language with a Rust-like syntax and with a few lines of code compile into WASM (Web ...| blog.mathieuacher.com
I find a way to systematically force a win in 4 moves against the last release of OpenAI (ChatGPT-4o) and in 7 moves against the best GPT at chess gpt-3.5-turbo-instruct, estimated at 1750 Elo. Clickbait: you don’t need to master chess to beat GPTs, you can well be 700 Elo, just follow the moves ;-) The story behind the discovery is interesting and worth sharing. I also discuss the generalization of the results and the implications for the future of GPTs and chess. Don’t fear the clickbai...| blog.mathieuacher.com
Shawn Presser has released an intringuing chess engine based on deep learning-based language model (GPT-2). The model was trained on the Kingbase dataset (3.5 million chess games in PGN notation) in 24 hours using 146 TPUs (ouch!). The engine is purely based on text prediction with no concept of chess. Though GPT-2 has already delivered promising/bluffing results for text generation, one can be skeptical and wonder whether it does work for chess.| blog.mathieuacher.com
Can GPTs like ChatGPT-4 play legal moves and finish chess games? What is the actual Elo rating of GPTs? There have been some hypes, (subjective) assessment, and buzz lately from “GPT is capable of beating 99% of players?” to “GPT plays lots of illegal moves” to “here is a magic prompt with Magnus Carlsen in the headers”. There are more or less solid anecdotes here and there, with counter-examples showing impressive failures or magnified stories on how GPTs can play chess well. I...| blog.mathieuacher.com