TMA (Tensor Memory Accelerator) is essential to archive peak bandwidth on GPUs. In the past I wrote a blogpost on in pure CUDA. It might help to read t...| simons blog
In this blogpost I want to show how to implement highly efficent matrix transpose operation for Hopper GPUs. I will use native CUDA APIs without abstract...| simons blog