Giles Thomas's blog: Practical insights on AI, startups, and software development, drawn from 30 years of building technology and 20 years of blogging.| www.gilesthomas.com
Starting training our LLM requires a loss function, which is called cross entropy loss. What is this and why does it work?| Giles' Blog
Archive of Giles Thomas’s blog posts from May 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
My last post, about the maths you need to start understanding LLMs, took off on Hacker News over the weekend. It's always nice to see lots of people reading and -- I hope! -- enjoying something that you've written. But there's another benefit. If enough people read something, some of them will spot errors or confusing bits -- "given enough eyeballs, all bugs are shallow". Commenter bad_ash made the excellent point that in the phrasing I originally had, a naive reader might think that activati...| Giles' blog
What actually goes on inside an LLM to make it calculate probabilities for the next token?| Giles' Blog
Posts in the 'Python' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Posts in the 'Linux' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Archive of Giles Thomas’s blog posts from September 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
A quick refresher on the maths behind LLMs: vectors, matrices, projections, embeddings, logits and softmax.| Giles' Blog
Archive of Giles Thomas’s blog posts from August 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
I'm getting towards the end of chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". When I first read this chapter, it seemed to be about tricks to use to make LLMs trainable, but having gone through it more closely, only the first part -- on layer normalisation -- seems to fit into that category. The second, about the feed-forward network is definitely not -- that's the part of the LLM that does a huge chunk of the thinking needed for next-token prediction. An...| Giles' blog
I've now finished chapter 4 in Sebastian Raschka's book "Build a Large Language Model (from Scratch)", having worked through shortcut connections in my last post. The remainder of the chapter doesn't introduce any new concepts -- instead, it shows how to put all of the code we've worked through so far into a full GPT-type LLM. You can see my code here, in the file gpt.py -- though I strongly recommend that if you're also working through the book, you type it in yourself -- I found that even t...| Giles' blog
How AI chatbots like ChatGPT work under the hood -- the post I wish I’d found before starting 'Build a Large Language Model (from Scratch)'.| Giles' Blog
I'm still working through chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This chapter not only puts together the pieces that the previous ones covered, but adds on a few extra steps. I'd previously been thinking of these steps as just useful engineering techniques ("folding, spindling and mutilating" the context vectors) to take a model that would work in theory, but not in practice, and make it something trainable and usable -- but in this post I'll expl...| Giles' blog
The feed-forward network in an LLM processes context vectors one at a time. This feels like it would cause similar issues to the old fixed-length bottleneck, even though it almost certainly does not.| Giles' Blog
Working through layer normalisation -- why do we do it, how does it work, and why doesn't it break everything?| Giles' Blog
A couple of lessons learned in moving from Fabric3 to Fabric| Giles' Blog
Posts in the 'TIL deep dives' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Archive of Giles Thomas’s blog posts from June 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
The way we get from context vectors to next-word prediction turns out to be simpler than I imagined -- but understanding why it works took a bit of thought.| Giles' Blog
After 14 years, it's time for me to move on -- but PythonAnywhere is in great hands and has a fantastic future ahead of it!| Giles' Blog
A pause to take stock: realising that attention heads are simpler than I thought explained why we do the calculations we do.| Giles' Blog
Finally getting to the end of chapter 3 of Raschka’s LLM book! This time it’s multi-head attention: what it is, how it works, and why the code does what it does.| Giles' Blog
Posts in the 'LLM from scratch' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Archive of Giles Thomas’s blog posts from April 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Batching speeds up training and inference, but for LLMs we can't just use matrices for it -- we need higher-order tensors.| Giles' Blog
Posts in the 'Musings' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Posts in the 'AI' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Why dropout is kind of like the mandatory vacation policies financial firms have| Giles' Blog
I've added a Markdown version of the site for AIs to read.| Giles' Blog
Adding dropout to the LLM's training is pretty simple, though it does raise one interesting question| Giles' Blog
Causal, or masked self-attention: when we're considering a token, we don't pay attention to later ones. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 9/??| Giles' Blog
Archive of Giles Thomas’s blog posts from March 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Moving on from a toy self-attention mechanism, it's time to find out how to build a real trainable one. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 8/??| Giles' Blog
Although it might seem that AI will make it pointless, I still think it's worth blogging.| Giles' Blog
Learning how to optimise self-attention calculations in LLMs using matrix multiplication. A deep dive into the basic linear algebra behind attention scores and token embeddings. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 7/??| Giles' Blog
Pondering how display technology's evolution from CRTs to modern screens influenced web design and dark mode, plus details of this site's new retro-inspired redesign.| Giles' Blog
How this blog now supports mathematical notation using MathML, enabling clean rendering of equations and matrices without JavaScript dependencies.| Giles' Blog
The essential matrix operations needed for neural networks. For ML beginners.| Giles' Blog
How we actually do matrix operations for neural networks in frameworks like PyTorch. For ML beginners.| Giles' Blog
I went through my blog archives and learned some lessons about blogging.| Giles' Blog
Archive of Giles Thomas’s blog posts from February 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Learning in public helps me grow as an engineer and seems to benefit others too. Here's why I should do more.| Giles' Blog