Recently I tweeted about realistic Speed-of-Light (SOL) of 5090 and RTX PRO 6000 for some dtypes, and mobicham asked me about FP8 MMA with FP16 accumulation. I of last year would turn to Triton for this - it’s trivial to change the accumulation dtype of tl.dot(). However, I roughly know how to write a fast matmul kernel now, so why not do it myself! In addition, I have been tinkering around with torch.cuda._compile_kernel(), which compiles CUDA kernels super fast via NVRTC. This seems ideal...