Last year, the Character.AI team released a blog post that detailed their approach in building a highly efficient inference system that serves over 20,000 inference queries per second, roughly 20% of Google Search's traffic. They focused on reducing the KV Cache size, which plays a significant role in increasing the decoding speed in transformers, but is the key bottleneck in inference. Specifically, they implemented three model architecture changes which reduced the KV cache size by more tha...