TL;DR: Most C++ and Rust thread‑pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork‑join workloads and micro-benchmarks. So I’ve drafted a minimal ~300‑line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies. OpenMP has been the industry workhorse for coarse‑grain parallelism in C and C++ for de...| Ash's Blog
You’ve probably seen a CUDA tutorial like this one — a classic “Hello World” blending CPU and GPU code in a single “heterogeneous” CUDA C++ source file, with the kernel launched using NVCC’s now-iconic triple-bracket <<<>>> syntax: 1 2 3 4 5 6 7 8 9 10 11 #include #include __global__ void kernel() { printf("Hello World from block %d, thread %d\n", blockIdx.x, threadIdx.x); } int main() { kernel<<<1, 1>>>(); // Returns `void`?! 🤬 return cudaDeviceSynchronize() == cudaSuccess...| Ash's Blog
The race for AI dominance isn’t just about who has the most computing - it’s increasingly about who can use it most efficiently. With the recent emergence of DeepSeek and other competitors in the AI space, even well-funded companies are discovering that raw computational power isn’t enough. The ability to squeeze maximum performance out of hardware through low-level optimization is becoming a crucial differentiator. One powerful tool in this optimization arsenal is the ability to work d...| Ash's Blog
For High-Performance Computing engineers, here’s the gist: On Intel CPUs, the vaddps instruction (vectorized float addition) executes on ports 0 and 5. The vfmadd132ps instruction (vectorized fused float multiply-add, or FMA) also executes on ports 0 and 5. On AMD CPUs, however, the vaddps instruction takes ports 2 and 3, and the vfmadd132ps instruction takes ports 0 and 1. Since FMA is equivalent to simple addition when one of the arguments is 1, we can drastically increase the throughput ...| Ash's Blog
It’s 2025. Sixteen years ago, someone asked on StackOverflow how to split a string in C++. With 3000 upvotes, you might think this question has been definitively answered. However, the provided solutions can be greatly improved in terms of both flexibility and performance, yielding up to a 10x speedup. In this post, we’ll explore three better ways to split strings in C++, including a solution I briefly mentioned in 2024 as part of a longer review of the Painful Pitfalls of C++ STL Strings.| Ash's Blog
This blogpost is a mirror of the original post on Modular.com. Modern CPUs have an incredible superpower: super-scalar operations, made available through single instruction, multiple data (SIMD) parallel processing. Instead of doing one operation at a time, a single core can do up to 4, 8, 16, or even 32 operations in parallel. In a way, a modern CPU is like a mini GPU, able to perform a lot of simultaneous calculations. Yet, because it’s so tricky to write parallel operations, almost all t...| Ash's Blog