My life is not a math Olympiad| Graham King
TIL: Running a gpt-oss eval suite against LM Studio on a Mac The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on Verifying gpt-oss implementations.I decided to try and run that eval suite on my own MacBook Pro, against gpt-oss-20b running inside of LM Studio. TLDR: once I had the model running inside LM Studio with a longer than default context limit, the following incantation ran an eval suite in around 3.5 hours...| Simon Willison's Weblog
Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers. The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high": These are some varied results! 93.3%: Cerebras, Nebius Base, Fir...| Simon Willison's Weblog
Fun, creative new micro-eval. Split the world into a sampled collection of latitude longitude points and for each one ask a model: If this location is over land, say 'Land'. …| Simon Willison’s Weblog
Hello friends. This blog post was supposed to be the second part of this re| evilsocket
A well-built custom eval lets you quickly test the newest models, iterate faster when developing prompts and pipelines, and ensure you’re always moving forward against your product’s specific goal. Let’s build an example eval – made from Jeopardy questions – to illustrate the value of a custom eval.| Drew Breunig
Earlier this year, I wrote Your AI product needs evals. Many of you asked, “How do I get started with LLM-as-a-judge?” This guide shares what I’ve learned after helping over 30 companies set up their evaluation systems. The Problem: AI Teams Are Drowning in Data Ever spend weeks building an AI system, only to realize you have no idea if it’s actually working? You’re not alone. I’ve noticed teams repeat the same mistakes when using LLMs to evaluate AI outputs: Too Many Metrics: Cre...| Hamel's Blog