Be one of the first to access some of Google’s latest AI advancements. Plus, get access to 2 TB storage, Gemini in Gmail, Docs, and more from Google One.| Gemini
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning t...| arXiv.org