In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA.My goal is not to build a cuBLAS replacement, but to deepl...| siboehm.com
elashri 89 days ago | next [–]| news.ycombinator.com
The guide to building CUDA applications for GPUs based on the NVIDIA Ampere GPU Architecture.| docs.nvidia.com
Making Deep Learning Go Brrrr From First Principles| horace.io
GPUs accelerate machine learning operations by performing calculations in parallel. Many operations, especially those representable as matrix multipliers will see good acceleration right out of the box. Even better performance can be achieved by tweaking operation parameters to efficiently use GPU resources. The performance documents present the tips that we think are most widely useful.| NVIDIA Docs