Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference. This has contributed to a massive increase in LLM context length in the last two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), or even 1M (Ll...| pytorch.org
If we as an ecosystem hope to make progress, we need to understand how the CUDA software empire became so dominant.| www.modular.com
Part 1 of an article that explores the future of hardware acceleration for AI beyond CUDA, framed in the context of the release of DeepSeek| www.modular.com
1.1. Scalable Data-Parallel Computing using GPUs| docs.nvidia.com