Kylan Gibbs, CEO of Inworld, joins the show to discuss the technical challenges of creating interactive AI for virtual worlds and games, the significance of user experience, and the importance of accessibility and cost-efficiency in deploying AI models.| Stack Overflow Blog
I was going over the code base of mini-swe-agent today. The core agent loop is 100 lines long. All agentic framework does something similar. Interesting facts about mini-swe-agent: The Mini-SWE-Agent operates in a continuous loop, iteratively solving problems by querying an LLM for actions, executing bash commands, and observing results until the task is complete. … Continue reading "Notes on mini-swe-agent"| Shekhar Gulati
Any time I share my collection of tools built using vibe coding and AI-assisted development (now at 124, here's the definitive list) someone will inevitably complain that they're mostly trivial. A lot of them are! Here's a list of some that I think are genuinely useful and worth highlighting: OCR PDFs and images directly in your browser. This is the tool that started the collection, and I still use it on a regular basis. You can open any PDF in it (even PDFs that are just scanned images with ...| Simon Willison's Weblog
Beyond Vibe Coding Back in May I wrote Two publishers and three authors fail to understand what “vibe coding” means where I called out the authors of two forthcoming books on "vibe coding" for abusing that term to refer to all forms of AI-assisted development, when Not all AI-assisted programming is vibe coding based on the original Karpathy definition.I'll be honest: I don't feel great about that post. I made an example of those two books to push my own agenda of encouraging "vibe coding...| Simon Willison's Weblog
gov.uscourts.dcd.223205.1436.0_1.pdf Here's the 230 page PDF ruling on the 2023 United States v. Google LLC federal antitrust case - the case that could have resulted in Google selling off Chrome and cutting most of Mozilla's funding.I made it through the first dozen pages - it's actually quite readable. It opens with a clear summary of the case so far, bold highlights mine: Last year, this court ruled that Defendant Google LLC had violated Section 2 of the Sherman Act: “Google is a monopol...| Simon Willison's Weblog
Rich Pixels Neat Python library by Darren Burns adding pixel image support to the Rich terminal library, using tricks to render an image using full or half-height colored blocks.Here's the key trick - it renders Unicode ▄ (U+2584, "lower half block") characters after setting a foreground and background color for the two pixels it needs to display. I got GPT-5 to vibe code up a show_image.py terminal command which resizes the provided image to fit the width and height of the current terminal...| Simon Willison's Weblog
Introducing gpt-realtime Released a few days ago (August 28th), gpt-realtime is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released last October.This is a slightly confusing release. The previous realtime model was clearly described as a variant of GPT-4o, sharing the same October 2023 training cut-off date as that model. I had expected that gpt-realtime might be a GPT-5 relative, but its traini...| Simon Willison's Weblog
Cloudflare Radar: AI Insights Cloudflare launched this dashboard back in February, incorporating traffic analysis from Cloudflare's network along with insights from their popular 1.1.1.1 DNS service.I found this chart particularly interesting, showing which documented AI crawlers are most active collecting training data - lead by GPTBot, ClaudeBot and Meta-ExternalAgent: Cloudflare's DNS data also hints at the popularity of different services. ChatGPT holds the first place, which is unsurpris...| Simon Willison's Weblog
Claude Opus 4.1 and Opus 4 degraded quality Notable because often when people complain of degraded model quality it turns out to be unfounded - Anthropic in the past have emphasized that they don't change the model weights after releasing them without changing the version number.In this case a botched upgrade of their inference stack cause a genuine model degradation for 56.5 hours: From 17:30 UTC on Aug 25th to 02:00 UTC on Aug 28th, Claude Opus 4.1 experienced a degradation in quality for s...| Simon Willison's Weblog
LLMs are intelligence without agency—what we might call "vox sine persona": voice without person. Not the voice of someone, not even the collective voice of many someones, but a voice emanating from no one at all. — Benj Edwards Tags: benj-edwards, ai-personality, generative-ai, ai, llms| Simon Willison's Weblog
The perils of vibe coding I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy yesterday: From the article, with links added by me to relevant projects: Willison thinks the best way to see what a new model can do is to ask for something unusual. He likes to request an SVG (an image made out of lines described with code) of a pelican on a bike and asks it to remember the chickens in his garden ...| Simon Willison's Weblog
We simply don’t know to defend against these attacks. We have zero agentic AI systems that are secure against these attacks. Any AI that is working in an adversarial environment—and …| Simon Willison’s Weblog
Piloting Claude for Chrome Two days ago I said:I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely. Today Anthropic announced their own take on this pattern, implemented as an invite-only preview Chrome extension. To their credit, the majority of the blog post and accompanying support article is information about the security risks. From their post: Just as people encounter phishing attempts in their inboxes, browser-using AIs...| Simon Willison's Weblog
Will Smith’s concert crowds are real, but AI is blurring the lines Great piece from Andy Baio demonstrating quite how convoluted the usage ethics and backlash against generative AI has become.Will Smith has been accused of using AI to misleadingly inflate the audience sizes of his recent tour. It looks like the audiences were real, but the combined usage of static-image-to-video models by his team with YouTube's ugly new compression experiments gave the resulting footage an uncanny valley e...| Simon Willison's Weblog
Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet The security team from Brave took a look at Comet, the LLM-powered "agentic browser" extension from Perplexity, and unsurprisingly found security holes you can drive a truck through.The vulnerability we’re discussing in this post lies in how Comet processes webpage content: when users ask it to “Summarize this webpage,” Comet feeds a part of the webpage directly to its LLM without distinguishing between the user’s...| Simon Willison's Weblog
ChatGPT release notes: Project-only memory The feature I've most wanted from ChatGPT's memory feature (the newer version of memory that automatically includes relevant details from summarized prior conversations) just landed:With project-only memory enabled, ChatGPT can use other conversations in that project for additional context, and won’t use your saved memories from outside the project to shape responses. Additionally, it won’t carry anything from the project into future chats ou...| Simon Willison's Weblog
DeepSeek 3.1 The latest model from DeepSeek, a 685B monster (like DeepSeek v3 before it) but this time it's a hybrid reasoning model.DeepSeek claim: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. Drew Breunig points out that their benchmarks show "the same scores with 25-50% fewer tokens" - at least across AIME 2025 and GPQA Diamond and LiveCodeBench. The DeepSeek release includes prompt examples for a coding agent, a python agent an...| Simon Willison's Weblog
too many model context protocol servers and LLM allocations on the dance floor Useful reminder from Geoffrey Huntley of the infrequently discussed significant token cost of using MCP.Geoffrey estimate estimates that the usable context window something like Amp or Cursor is around 176,000 tokens - Claude 4's 200,000 minus around 24,000 for the system prompt for those tools. Adding just the popular GitHub MCP defines 93 additional tools and swallows another 55,000 of those valuable tokens! MCP ...| Simon Willison's Weblog
Most classical engineering fields deal with probabilistic system components all of the time. In fact I'd go as far as to say that inability to deal with probabilistic components is disqualifying from many engineering endeavors. Process engineers for example have to account for human error rates. On a given production line with humans in a loop, the operators will sometimes screw up. Designing systems to detect these errors (which are highly probabilistic!), mitigate them, and reduce the occur...| Simon Willison's Weblog
I was at a leadership group and people were telling me "We think that with AI we can replace all of our junior people in our company." I was like, "That's the dumbest thing I've ever heard. They're probably the least expensive employees you have, they're the most leaned into your AI tools, and how's that going to work when you go 10 years in the future and you have no one that has built up or learned anything? — Matt Garman, CEO, Amazon Web Services Tags: ai-ethics, careers, generative-ai, ...| Simon Willison's Weblog
what’s the point of vibe coding if at the end of the day i still gotta pay a dev to look at the code anyway. sure it feels kinda cool …| Simon Willison’s Weblog
If I asked you to guess the job title of someone coding an app for work, your first guess probably wouldn’t be “writer”. It probably wouldn’t be your second or fifth guess either.| stackoverflow.blog
Today I was going over a paper by Microsoft Research team on how AI is impacting professsional work. This paper was published in July 2025. They analyzed 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot to understand how generative AI impacts different occupations and work activities. They seperated analysis into two distinct … Continue reading "Paper: Working with AI: Measuring the Occupational Implications of Generative AI"| Shekhar Gulati
Google recently released Gemma 3 270M, a remarkably compact 270 million parameter language model that promises efficient AI capabilities in a tiny package. As someone building AI voice agents, I was immediately interested in testing whether this model could handle one of my simplest but frequent use cases: generating message variations for conversational AI. For … Continue reading "I Tested Gemma 3 270M on the Simplest NLP Task"| Shekhar Gulati
Today, I was browsing Hacker News when I stumbled upon an interesting project: coderunner-ui. The premise was compelling – a local-first AI workspace that lets you chat with LLMs and execute …| Shekhar Gulati
Fun, creative new micro-eval. Split the world into a sampled collection of latitude longitude points and for each one ask a model: If this location is over land, say 'Land'. …| Simon Willison’s Weblog
I shipped LLM 0.27 today (followed by a 0.27.1 with minor bug fixes), adding support for the new GPT-5 family of models from OpenAI plus a flurry of improvements to …| Simon Willison’s Weblog
I gave a talk on Wednesday at the Bay Area AI Security Meetup about prompt injection, the lethal trifecta and the challenges of securing systems that use MCP. It wasn’t …| Simon Willison’s Weblog
I’ve been dipping into the r/ChatGPT subreddit recently to see how people are reacting to the GPT-5 launch, and so far the vibes there are not good. This AMA thread …| Simon Willison’s Weblog
I’ve had preview access to the new GPT-5 model family for the past two weeks (see related video and my disclosures) and have been using GPT-5 as my daily-driver. It’s …| Simon Willison’s Weblog
In this episode of Leaders of Code, Jody Bailey, Stack Overflow’s CTPO, Anirudh Kaul, Senior Director of Software Engineering, and Paul Petersen, Cloud Platform Engineering Manager, discuss the U.S. Bank’s journey from traditional banking practices to embracing new technologies.| Stack Overflow Blog
Ryan welcomes Mahir Yavuz, Senior Director of Engineering at Etsy, to the show to explore the unique challenges that Etsy’s marketplace faces and how Etsy’s teams leverage machine learning and AI to manage product SKUs, enrich inventory metadata, and improve both buyer and seller experiences.| Stack Overflow Blog
I have spent last few months working on a regulatory intelligence software. One of the important feature is extracting obligations from dense PDF documents. In this post I am sharing some of the le…| Shekhar Gulati
Today I was reading OpenAI guide on model selection https://platform.openai.com/docs/guides/model-selection where they explained how to calculate a reaslistic accuracy target for LLM task by evaluating financial impact of model decisions. They gave an example of fake news classifier. This is a good way to find the accuracy you need for the task. Break-even accuracy is … Continue reading "Setting a realistic accuracy target for LLM tasks"| Shekhar Gulati
Cursor, the AI-powered code editor that has transformed how developers write code, recently underwent a significant pricing overhaul that has sparked intense debate in the developer community. The …| Shekhar Gulati
Ryan is joined by Kieran Furlong, CEO of Realta Fusion, to talk about the future of fusion as a safe and sustainable energy source, the computation and scientific advancements that have made fusion possible, and how fusion technology innovations will address data and AI’s rising energy demands.| Stack Overflow Blog
In the last blog I discussed how I use OpenAI Code Interpreter to do RAG over data (CSV, Excel, etc.) files. OpenAI Code Interpreter is a managed offering and it does have some limitations. So, I was looking for an open source alternative. I discovered Pydantic team’s MCP Run Python package. It is an MCP … Continue reading "Using Pydantic MCP Run Python as an Open Source Alternative to OpenAI Code Interpreter"| Shekhar Gulati
While large language models (LLMs) have achieved remarkable capabilities in processing long contexts and locating specific information, a recent paper reveals a surprising blind spot: they struggle…| Shekhar Gulati
When building RAG systems, one common challenge is helping users query their own data. Users often come with a couple of Excel files, Word documents, or CSV files and want to ask questions like …| Shekhar Gulati
Solomon Hykes just presented the best definition of an AI agent I've seen yet, on stage at the AI Engineer World's Fair: An AI agent is an LLM wrecking its …| Simon Willison’s Weblog
Hey there! It's good to be back on the blog. Over the past few months, I've been focused on setting up the foundations for A New Social. I couldn't have imagined this is where I'd end up after writing my Bridges & The Last Network Effect post, but here we are!| augment
One term that I have been hearing a lot lately is reward hacking. I have heard this term multiple times from folks at OpenAI and Anthropic, and it represents a fundamental challenge in AI alignment…| Shekhar Gulati
Mistral released a new model yesterday. It is designed to excel at Agentic coding tasks meaning it can use tools. It is Apache 2.0 license. It is finetuned from Mistral-Small-3.1, therefore it has …| Shekhar Gulati
Big upgrade to Mistral's API this morning: they've announced a new "Agents API". Mistral have been using the term "agents" for a while now. Here's how they describe them: AI …| Simon Willison’s Weblog
I was going slightly spare at the fact that every talk at this Anthropic developer conference has used the word "agents" dozens of times, but nobody ever stopped to provide …| Simon Willison’s Weblog
Classic slop: it listed real authors with entirely fake books. There's an important follow-up from 404 Media in their subsequent story: Victor Lim, the vice president of marketing and communications …| Simon Willison’s Weblog
Relatively thin post from OpenAI talking about their recent rollback of the GPT-4o model that made the model way too sycophantic - "overly flattering or agreeable", to use OpenAIs own …| Simon Willison’s Weblog
A practical guide for trial lawyers who want to try out AI LLMs (ChatGPT-4) in their practice and including simple-to-follow instructions and prompt examples.| Ball in your Court
Last night, I found myself overwhelmed by open tabs in Chrome. I wondered how many I had open, but couldn’t find a built-in tab counter. While third-party extensions likely existed, I am not …| Shekhar Gulati
In my previous post we built Prompt Injection Detector by training a LogisticRegression classifier on embeddings of SPML Chatbot Prompt Injection Dataset. Today, we will look at how we can fine-tun…| Shekhar Gulati
In the last couple of days, I’ve spent some hours playing with Patchwork. Patchwork is an open-source framework that leverages AI to accelerate asynchronous development tasks like code review…| Shekhar Gulati
Today I was watching a talk by Maggie Appleton from local-first conference. She points out in her insightful talk on homecooked software and barefoot developers, there exists a significant gap in a…| Shekhar Gulati
I was reading Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost paper today and thought of applying it to a problem I solved a couple of months back. This paper introduced the Con…| Shekhar Gulati
Today I was reading Chapter 9 “Multimodal Large Language Models” of Hands-On Large Language Models book and thought of applying it to a problem I face occassionally. The chapter covers …| Shekhar Gulati
I enjoy reading books on Oreilly learning platform . For the past month, a new feature on the Oreilly platform called “Answers” has been staring me down, and I haven’t been tempte…| Shekhar Gulati
Today, I want to expand on a topic I discussed in issue #2: publishers striking deals with AI companies and what that means for their futures and the publisher landscape as a whole.| augment
"My goal for the next issue is to not talk about the Fediverse." That was me in the last issue of Human-Generated Content and I would like to start by apologizing for this very predictable lie. Hello, again! Last time, we talked about the diverging strategies between publishers choosing AI| augment
Publishers are seeing two very different futures for their businesses. Is the future of media aggregated and summarized or is it direct-to-audience?| augment
This post is my thought after working in the GenAI startup space a bit and observing many peers in the space. Building a successful startup (and its products) is always hard, but I feel that building in GenAI space with a small team and budget may be actually harder than the average, in contrast to […]| piaoyang
Swing dancing and prompt engineering are pretty different. But could learning one help us learn the other?| alexwlchan.net
Reader’s Digest, the century-old magazine with the highest paid circulation, has long published “condensed” books; anthologies of four-to-five popular novels abridged to fit in a single volume.&nbs…| Ball in your Court