Working through layer normalisation -- why do we do it, how does it work, and why doesn't it break everything?| Giles' Blog