Recent days, many papers have been published to optimize LLM inference. This post introduces two of them, which focus on improving throughput by exploiting characteristics of batched LLM serving and characteristics of attention. Orca # Orca, published in OSDI'22, proposes two novel techniques: 1. continuous batcing (or iteration-level scheduling) 1, and 2. selective batching. Continuous Batching # Before the introduction of continuous batching, static batching starts batch at once and wait al...