1 Preliminaries Symbol Meaning $d$ width of the residual stream $L$ number of Transformer blocks $V$ vocabulary size, so logits live in $\mathbb R^{V}$ $h^{(\ell)}$ residual-stream vector entering block $\ell$ $r^{(\ell)}$ the update written by block $\ell$ $W_U\!\in\!\mathbb R^{V\times d},\;b\in\mathbb R^{V}$ un-embedding matrix and bias Additive residual stream. With (pre-/peri-norm) residual connections, $$ h^{(\ell+1)} \;=\; h^{(\ell)} + r^{(\ell)},\qquad \ell=0,\dots,L-1. $$ Hence the ...