Table of Contents 1 Introduction 1.1 Problem statement 1.2 Motivation 1.3 Recommended reading 1.4 Binary search and Eytzinger layout 1.5 Hugepages 1.6 A note on benchmarking 1.7 Cache lines 1.8 S-trees and B-trees 2 Optimizing find 2.1 Linear 2.2 Auto-vectorization 2.3 Trailing zeros 2.4 Popcount 2.5 Manual SIMD 3 Optimizing the search 3.1 Batching 3.2 Prefetching 3.3 Pointer arithmetic 3.3.1 Up-front splat 3.3.2 Byte-based pointers 3.3.3 The final version 3.4 Skip prefetch 3.5 Interleave 4 O...| CuriousCoding
📜 Intrinsics are like a recipe for the most optimized code that you can write to achieve certain behavior.| That One Game Dev
A couple of years ago I wrote about tolower() in bulk at speed using| dotat.at
本文永久链接 – https://tonybai.com/2024/07/21/simd-in-go| tonybai.com
本文永久链接 - https://tonybai.com/2024/07/21/simd-in-go 前些日子,一些资深Gopher,比如fasthttp的作者Aliaksandr Valialkin因函数迭代器加入Go 1.23版本而抱怨Go的演进走错了方向:朝着增加复杂性和隐式代| tonybai.com
When our Python code is too slow, like most others we switch to C and often get 100x speed boosts, just like when we replaced SciPy distance computations with SimSIMD. But imagine going 100x faster than C code! It sounds crazy, especially for number-crunching tasks that are “data-parallel” and easy for compilers to optimize. In such spots the compiler will typically “unroll” the loop, vectorize the code, and use SIMD instructions to process multiple data elements in parallel.| ashvardanian.com
GPU acceleration can be trivial for Python users. Follow CUDA installation steps carefully, replace import numpy as np with import cupy as np, and you will often get the 100x performance boosts without breaking a sweat. Every time you write magical one-liners, remember a systems engineer is making your dreams come true. A couple of years ago, when I was giving a talk on the breadth of GPGPU technologies, I published a repo.| ashvardanian.com
This will be a story about many things: about computers, about their (memory) speed limits, about very specific workloads that can push computers to those limits and the subtle differences in Hash-Tables (HT) designs. But before we get in, here is a glimpse of what we are about to see. A friendly warning, the following article contains many technical terms and is intended for somewhat technical and hopefully curious readers.| ashvardanian.com
A single software company can spend over 💲10 Billion/year, on data centres, but not every year is the same. When all stars align, we see bursts of new technologies reaching the market simultaneously, thus restarting the purchasing super-cycle. 2022 will be just that, so let’s jump a couple of quarters ahead and see what’s on the shopping list of your favorite hyperscaler! Friendly warning: this article is full of technical terms and jargon, so it may be hard to read if you don’t writ...| ashvardanian.com
There are only two kinds of languages: the ones people complain about and the ones nobody uses. – Bjarne Stroustrup, creator of C++. Very few consider C++ attractive, and only some people think it’s easy. Choosing it for a project generally means you care about the performance of your code. And rightly so! Today, machines can process hundreds of Gigabytes per second, and we, as developers, should all learn to saturate those capabilities.| ashvardanian.com
This section is a follow-up to the previous one, where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching in sorted arrays, but this time we are not limited to fetching and comparing only one element at a time.| en.algorithmica.org
Introduction and index of this series is here.| Aras' website