Last year, the Character.AI team released a blog post that detailed their approach in building a highly efficient inference system that serves over 20,000 inference queries per second, roughly 20% of Google Search's traffic. They focused on reducing the KV Cache size, which plays a significant role in increasing the decoding speed in transformers, but is the key bottleneck in inference. Specifically, they implemented three model architecture changes which reduced the KV cache size by more tha...| njkumarr
This past week, I came across the DiffusionDB dataset curated by the Polo Club of Data Science at Georgia Tech. They scraped over 14 million image-prompt pairs collected from users generating images in the Stable Diffusion Discord. Each entry includes the image and the text prompt used to create the image, along with detailed metadata such as the sampler settings, image properties, and usernames.| njkumarr
I was recently re-reading Finbarr Timber’s post on transformer inference optimizations, and I wanted to try to implement each of these techniques1 in nanoGPT to see how much we could practically speed up inference with GPT-2’s architecture and reduce the computational bottlenecks in the model.| njkumarr
This post is going to consist of my ongoing review of various sampling methods I’ve found through papers and online implementations. While samplers are relatively straightforward to create and combine, there is a challenge in trying to sample both high quality and diverse outputs. I'm particularly interested in finding samplers that are context-aware, resist hallucination, and can explore multiple reasoning paths in language models. For most of the samplers below I link their corresponding ...| njkumarr
GPT-o introduces a new tokenizer , which both doubles the model's vocabulary size to 200k (previously 100k with GPT-4) and significantly improves token ...| njkumarr