This post explains the basic of LLM inference, mainly focusing on differences from training LLM. Autreogressive Text Generation # Unlike training, where tokens are parallelized and trained, inference generates tokens one by one. Therefore, to create a full sentence, several forward pass should be executed (# tokens times). The following video from HuggingFace illustrates how it works. Autoregressive token generation. Source: HuggingFace Before generating the first token, LLM first puts all in...