Working through layer normalisation -- why do we do it, how does it work, and why doesn't it break everything?| Giles' Blog
A pause to take stock: starting to build intuition on how self-attention scales (and why the simple version doesn't)| Giles' Blog
A pause to take stock: realising that attention heads are simpler than I thought explained why we do the calculations we do.| Giles' Blog
Batching speeds up training and inference, but for LLMs we can't just use matrices for it -- we need higher-order tensors.| Giles' Blog
Causal, or masked self-attention: when we're considering a token, we don't pay attention to later ones. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 9/??| Giles' Blog
Although it might seem that AI will make it pointless, I still think it's worth blogging.| Giles' Blog
Learning how to optimise self-attention calculations in LLMs using matrix multiplication. A deep dive into the basic linear algebra behind attention scores and token embeddings. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 7/??| Giles' Blog