Quantum computing promises to reshape industries — but progress hinges on solving key problems. Error correction. Simulations of qubit designs. Circuit compilation optimization tasks. These are among the bottlenecks that must be overcome to bring quantum hardware into the era of useful applications. Enter accelerated computing. The parallel processing of accelerated computing offers the power Read Article| NVIDIA Blog
CUDA PTX ldmatrix Instruction and Its CuTe Wrapper| Lei Mao's Log Book
F5 has announced its intent to acquire enterprise AI security company CalypsoAI, whose award-winning platform brings real-time threat defense, red teaming at scale, and data security to enterprises racing to deploy generative and agentic AI.| ITOps Times
In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA.My goal is not to build a cuBLAS replacement, but to deepl...| siboehm.com
CuTe DSL Basics — From Hello to Tiled Kernels| Chris Choy
Designing Tilers for Data Access| Lei Mao's Log Book
Posted by the TensorFlow teamTensorFlow 2.18 has been released! Highlights of this release (and 2.17) include NumPy 2.0, LiteRT repository, CUDA Update, Hermetic CUDA and more. For the full release notes, please click here.| The TensorFlow Blog
Posted by the TensorFlow teamTensorFlow 2.17 has been released! Highlights of this release (and 2.16) include CUDA update, upcoming Numpy 2.0, and more. For the full release notes, please click here.| The TensorFlow Blog
Essential Constants for Numerical Algorithms and Scientific Computations| Lei Mao's Log Book
Rust CUDA enables you to write and run CUDA| Rust GPU Blog
Deriving Inverse Layout Mathematically| Lei Mao's Log Book
Creating Tiled Layouts Using Blocked Product and Raked Product| Lei Mao's Log Book
Elucidating CuTe Inner Partition and Local Tile| Lei Mao's Log Book
Elucidating CuTe Outer Partition and Local Partition| Lei Mao's Log Book
CUDA Memory Load/Store Performance: A Comprehensive Benchmark Analysis| Chris Choy
Avoiding CUDA Shared Memory Bank Conflicts| Lei Mao's Log Book
Rust CUDA enables you to write and run CUDA| Rust GPU Blog
In the latest round of MLPerf Training, the NVIDIA AI platform delivered the highest performance at scale on every benchmark.| NVIDIA Blog
brought to you by the ITS Research team at QMUL| blog.hpc.qmul.ac.uk
I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, […]| Probably Dance
The Rust CUDA project has been [rebooted after| Rust GPU Blog
We're excited to announce the reboot of the Rust CUDA project. Rust CUDA enables you to write and run CUDA kernels in Rust, executing directly on NVIDIA GPUs using NVVM IR.| rust-gpu.github.io
Learning CUDA by optimizing matrix-vector multiplication (SGEMV) for cuBLAS-like performance| Maharshi's blog
Unification of Memory on the Grace Hopper Nodes The delivery of new GPUs for research is continuing, most notable is the newIsambard-AI cluster atBristol. As new cutting-edge GPUs are released, software engineers aretasked with being made aware of the new architectures and features these newGPUs offer. The new Grace-Hopper GH200 nodes, as announced in a previous blogpost, consist of a 72-core NVIDIA Grace CPU and anH100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLinkChip-2-Ch...| QMUL ITS Research Blog
Fixing the pytorch unknown CUDA error.| The Cloistered Monkey
NAMD is a molecular dynamics program that can use GPU acceleration to speed up its calculations. Recent OpenPOWER machines like the IBM Power Systems S822LC for High Performance Computing (Minsky) come with a new interconnect for GPUs called NVLink, which offers extremely high bandwidth to a number of very powerful Nvidia Pascal P100 GPUs. So they're ideal machines for this sort of workload.| sthbrx.github.io
Our previous blogs (Taichi & PyTorch 01 and 02) pointed out that Taichi and Torch serve different application scenarios can they complement each other? And the answer is an unequivocal yes! In this blog, we will use two simple examples to explain how to use Taichi kernel to implement data preprocessing operators or custom ML operators. With Taichi, you can accelerate your ML model development with ease and get rid of the tedious low-level parallel programming (CUDA for example) for good.| docs.taichi-lang.org