Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance| Modular Blog
Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul| Modular Blog
Matrix Multiplication on Blackwell: Part 1 - Introduction| Modular Blog
In this blog post, we’ll continue our journey to build a state-of-the-art (SOTA) matmul kernel on NVIDIA Blackwell by exploring the cluster launch control (CLC) optimization. At the end of the post we’ll improve our performance by another 15% and achieve 1772 TFLOPs, exceeding that of the current SOTA.| www.modular.com