Archive of Giles Thomas’s blog posts from August 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
I'm getting towards the end of chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". When I first read this chapter, it seemed to be about tricks to use to make LLMs trainable, but having gone through it more closely, only the first part -- on layer normalisation -- seems to fit into that category. The second, about the feed-forward network is definitely not -- that's the part of the LLM that does a huge chunk of the thinking needed for next-token prediction. An...| Giles' blog
I've now finished chapter 4 in Sebastian Raschka's book "Build a Large Language Model (from Scratch)", having worked through shortcut connections in my last post. The remainder of the chapter doesn't introduce any new concepts -- instead, it shows how to put all of the code we've worked through so far into a full GPT-type LLM. You can see my code here, in the file gpt.py -- though I strongly recommend that if you're also working through the book, you type it in yourself -- I found that even t...| Giles' blog
How AI chatbots like ChatGPT work under the hood -- the post I wish I’d found before starting 'Build a Large Language Model (from Scratch)'.| Giles' Blog
I'm still working through chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This chapter not only puts together the pieces that the previous ones covered, but adds on a few extra steps. I'd previously been thinking of these steps as just useful engineering techniques ("folding, spindling and mutilating" the context vectors) to take a model that would work in theory, but not in practice, and make it something trainable and usable -- but in this post I'll expl...| Giles' blog
The feed-forward network in an LLM processes context vectors one at a time. This feels like it would cause similar issues to the old fixed-length bottleneck, even though it almost certainly does not.| Giles' Blog
Working through layer normalisation -- why do we do it, how does it work, and why doesn't it break everything?| Giles' Blog
Giles Thomas's blog: Practical insights on AI, startups, and software development, drawn from 30 years of building technology and 20 years of blogging.| www.gilesthomas.com
A couple of lessons learned in moving from Fabric3 to Fabric| Giles' Blog
Posts in the 'TIL deep dives' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Right now, starting a debugging session using AI before googling can leave you stuck, especially with newer technologies| Giles' Blog
Archive of Giles Thomas’s blog posts from June 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Having worked through chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and spent some time digesting the concepts it introduced (most recently in my post on the complexity of self-attention at scale), it's time for chapter 4. I've read it through in its entirety, and rather than working through it section-by-section in order, like I did with the last one, I think I'm going to jump around a bit, covering each new concept and how I wrapped my head around it s...| Giles' blog
After 14 years, it's time for me to move on -- but PythonAnywhere is in great hands and has a fantastic future ahead of it!| Giles' Blog
A pause to take stock: starting to build intuition on how self-attention scales (and why the simple version doesn't)| Giles' Blog
A pause to take stock: realising that attention heads are simpler than I thought explained why we do the calculations we do.| Giles' Blog
Finally getting to the end of chapter 3 of Raschka’s LLM book! This time it’s multi-head attention: what it is, how it works, and why the code does what it does.| Giles' Blog
Posts in the 'LLM from scratch' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Archive of Giles Thomas’s blog posts from April 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Should RSS feeds contain the full blog post, or just the introductory paragraphs?| Giles' Blog
After a straw poll, I'm putting the full blog post into the RSS feed| Giles' Blog
Batching speeds up training and inference, but for LLMs we can't just use matrices for it -- we need higher-order tensors.| Giles' Blog
Posts in the 'Musings' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Posts in the 'AI' category on Giles Thomas’s blog. Insights on AI, startups, software development, and technical projects, drawn from 30 years of experience.| www.gilesthomas.com
Why dropout is kind of like the mandatory vacation policies financial firms have| Giles' Blog
I've added a Markdown version of the site for AIs to read.| Giles' Blog
Adding dropout to the LLM's training is pretty simple, though it does raise one interesting question| Giles' Blog
Causal, or masked self-attention: when we're considering a token, we don't pay attention to later ones. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 9/??| Giles' Blog
Archive of Giles Thomas’s blog posts from March 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Moving on from a toy self-attention mechanism, it's time to find out how to build a real trainable one. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 8/??| Giles' Blog
Although it might seem that AI will make it pointless, I still think it's worth blogging.| Giles' Blog
Learning how to optimise self-attention calculations in LLMs using matrix multiplication. A deep dive into the basic linear algebra behind attention scores and token embeddings. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 7/??| Giles' Blog
Pondering how display technology's evolution from CRTs to modern screens influenced web design and dark mode, plus details of this site's new retro-inspired redesign.| Giles' Blog
How this blog now supports mathematical notation using MathML, enabling clean rendering of equations and matrices without JavaScript dependencies.| Giles' Blog
The essential matrix operations needed for neural networks. For ML beginners.| Giles' Blog
How we actually do matrix operations for neural networks in frameworks like PyTorch. For ML beginners.| Giles' Blog
I went through my blog archives and learned some lessons about blogging.| Giles' Blog
Archive of Giles Thomas’s blog posts from February 2025. Insights on AI, startups, and software development, plus occasional personal reflections.| www.gilesthomas.com
Learning in public helps me grow as an engineer and seems to benefit others too. Here's why I should do more.| Giles' Blog