Swizzling is a critical technique used in state-of-the-art GEMM kernels to avoid costly shared memory bank conflicts. For a quick refresher on what shared memory bank conflicts are and why they matter for performance, I recommend this excellent explanation.| simons blog
Introduction| simons blog
Introduction| simons blog
is an important pattern in design we didn't cover so far. The idea is that we have a persistent kernel but differing from the approach we employed befor...| simons blog
Efficiently transferring data from global to shared memory and vice versa is often the bottleneck in CUDA applications. Therefore it's important we leverage ...| simons blog
In this blogpost we want to briefly describe how to archive SOTA performance for the task of reduction on a vector, i.e. our program should do the following:...| simons blog
TMA (Tensor Memory Accelerator) is essential to archive peak bandwidth on GPUs. In the past I wrote a blogpost on in pure CUDA. It might help to read t...| simons blog
In C we can oftentimes implement certain operation efficiently using bitwise manipulations.| simons blog
Introduction| simons blog
In the past I wrote a persistent gemm using the . This blogpost will generalize the approach taken there and show how to modify the example for Hopper and...| simons blog
Here and below we will run| simons blog
To archive peak performance on H100 on the task of matrix transpose we need to prefetch matrix tiles when we employ a non persistent way of writing our kernels.| simons blog
In kernel design we can employ warp specialisation which means Producer and Consumer(s) will be processed in different warp(groups).| simons blog
are bound by memory transfer. As I explained in a past post this makes it oftentimes necessary to overlap memory transfer and computation to archive peak pe...| simons blog
From we can read that can be used to . The instruction looks as follows This instruction will collectively load one or more matrices from s...| simons blog
In the previous we learned about and atoms and how they are created in . Here we will continue to analyse the example for . In this blogpost the focus wi...| simons blog
To write performant Kernels for in we need the concepts of and . This will be the first part of a multi-part Blog Series for on that analyses the ex...| simons blog
SGEMM is one of the fundamental operations we aim to optimise on GPUs. In this blogpost I will explain the corresponding from the repo. I chose SGEMM bec...| simons blog
In this blogpost we want to show how to optimize blockwise prefix sum operation. Blockwise prefix sum does the following: Given a vector: we divide that...| simons blog
In this blogpost we will show how to perform very fast calculation of the Fibonacci sequence using GPU programming. In this blogpost we will employ an NVIDI...| simons blog
Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. It offers an abstraction ...| simons blog
The PTX instruction is essential to understand in depth to being able to program NVIDIA GPUs. In the past I did a blogpost on which you might to take a lo...| simons blog
In this blogpost I want to show how to implement highly efficent matrix transpose operation for Hopper GPUs. I will use native CUDA APIs without abstract...| simons blog
Tensorcores are dedicated units on the GPU to perform matrix multiplication. To leverage their full potential we need to write . This short note aims to de...| simons blog
In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the architecture using Mojo. The best kernel archive...| simons blog