I was recently re-reading Finbarr Timber’s post on transformer inference optimizations, and I wanted to try to implement each of these techniques1 in nanoGPT to see how much we could practically speed up inference with GPT-2’s architecture and reduce the computational bottlenecks in the model.