TLDR Top language models were thrown into the board game Diplomacy and forced to negotiate, ally, and betray. OpenAI’s 03 won by secretly forming coalitions and then knifing its friends. Gemini 2.5 Pro fought well but fell to a coordinated backstab. Claude tried to stay honest and paid the price. The open-source benchmark reveals which […]