Porting a CUDA Fast Fourier Transform (FFT) implementation to Mojo for the LeetGPU Fast Fourier Transform challenge presented an unexpected challenge: achieving bit-exact precision matching between CUDA’s sinf()/cosf() functions and their Mojo equivalents. This required PTX assembly analysis, cross-platform testing, and ultimately upgrading to Float64 precision for deterministic results. Challenge Constraints 🔗 N range: $1 \leq N \leq 262,144$ (power-of-2 FFT sizes) Data type: All values...| Posts on Matt Suiche
Technical analysis of AMD GPU support implementation in Triton's Gluon framework, including architecture-specific optimizations and performance characteristics.| www.msuiche.com
Analysis of RustBPE - a Rust implementation of BPE tokenizer training with parallel processing and performance optimizations over Python implementations.| www.msuiche.com
Learning GPU performance engineering through the GPU MODE TriMul challenge - achieving 2.42× speedup on H100 through FP16 optimization, weight fusion, and systematic experimentation.| www.msuiche.com
GPU production constraints are creating infrastructure bottlenecks. Multi-GPU programming, particularly vendor-agnostic implementations, has become essential. In their GPU Mode presentation, AMD Research engineers Muhammad Awad, Muhammad Osama, and Brandon Potter introduced Iris—a Python library that enables fine-grained multi-GPU programming in Triton. Similarly to my previous Gluon blogpost, this post captures my understanding and interpretation of their work, serving as both technical do...| www.msuiche.com