Ok, here's the story so far. Way back when I wanted to test whether sort + FastTwoSum is faster than TwoSum. But as I started optimizing it, I kept getting frustrated with how hard it is to benchmark low-level code, and eventually I wrote a simulator for my CPU, the Apple M1 chip. At this point, I was in too deep and wanted to keep improving the simulator. Compiling to Assembly As of the last blog post, I'd added a little DAG-building library so that I could easily write little assembly snipp...