In my recent post on optimising zlib decompression for the Apple M1, I used a loop that refilled a bit-buffer and decoded a huffman code each iteration, based on variant 4 from Fabian Giesen’…| dougallj
CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like eth…| dougallj
Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and meas…| dougallj
In the words of Tom Lehrer, “this is completely pointless, but may prove useful to some of you some day, perhaps in a somewhat bizarre set of circumstances.” The problem is as follows: …| dougallj
Rosetta 2 is remarkably fast when compared to other x86-on-ARM emulators. I’ve spent a little time looking at how it works, out of idle curiosity, and found it to be quite unusual, so I figur…| dougallj
DEFLATE is a relatively slow compression algorithm from 1991, which (along with its wrapper format, zlib) is incredibly widely used, for example in the PNG, Zip and Gzip file formats and the HTTP, …| dougallj
Variable length non-self-synchronising prefix codes (like x86 instructions and Huffman codes) are hard to decode in parallel, as each word must be decoded to figure out its length, before the next …| dougallj
I was inspired by Daniel Lemire’s blog post, Converting integers to fix-digit representations quickly (and the follow up Converting integers to decimal strings faster with AVX-512) to try sol…| dougallj
I came up with a (seemingly) new method to encode bitmask immediate values on ARM64. This really isn’t worth optimising – clarity and verifiability are more important – but itR…| dougallj
Many people, myself included, have held the belief that Spectre exploits need to know, understand, and manipulate microarchitectural details that are specific to a given processor design. Published…| dougallj