Learning CUDA by optimizing matrix-vector multiplication (SGEMV) for cuBLAS-like performance| Maharshi's blog
Learning CUDA by optimizing softmax that beats PyTorch| Maharshi's blog