Introduction Flash Attention1 is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It has been widely used in LLM inference and training, and is the default attention backend in modern serving engines like SGLang, vLLM, etc. Naive Attention Calculation Before we figure out how Flash Attention works, let’s first take a look at the naive attention calculation. \[\begin{align} \text{a...| Biao's Blog
Homepage of Tri Dao. # A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design.| tridao.me