Working through layer normalisation -- why do we do it, how does it work, and why doesn't it break everything?| Giles' Blog
A pause to take stock: starting to build intuition on how self-attention scales (and why the simple version doesn't)| Giles' Blog
Batching speeds up training and inference, but for LLMs we can't just use matrices for it -- we need higher-order tensors.| Giles' Blog
Adding dropout to the LLM's training is pretty simple, though it does raise one interesting question| Giles' Blog
Moving on from a toy self-attention mechanism, it's time to find out how to build a real trainable one. Following Sebastian Raschka's book 'Build a Large Language Model (from Scratch)'. Part 8/??| Giles' Blog