Chandler Caruth (I think - I can't for the life of me find the reference) said something in a cppcon talk years ago that blew my mind. More or less, 95% of code performance comes from the memory layout and memory access patters of data structures, and 5% comes from clever instruction selection and instruction stream optimization.