Authored by Biao He Zilin Zhu Ji Li 1. What is slime? slime is a LLM post-training framework aiming for RL Scaling, it was designed to be: Versatile – with a fully customizable rollout interface and flexible training setups (colocated or decoupled, synchronous or asynchronous, RL or SFT cold start). Performant - integrating SGLang for inference and Megatron-LM for training, natively. Maintainable - with a lightweight codebase and smooth transition from Megatron pretraining to SGLang dep...| Biao's Blog
Authored by Binyao Jiang This guide explains how to calculate the parameter size of a Mixture of Experts (MoE) large language model (LLM) using its architecture and configuration file. We’ll use the Qwen3-30B-A3B model as an example to demonstrate the process. 1. Understand the Model Architecture To calculate a model’s parameter size, you first need to understand its architecture. Initially, I considered technical reports as a primary source, but for models like Qwen3, which inherit the L...| Biao's Blog
A blog about my thoughts on ML Sys and LLMs| Biao's Blog
A blog about my thoughts on ML Sys and LLMs| Biao's Blog
Introduction Flash Attention1 is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It has been widely used in LLM inference and training, and is the default attention backend in modern serving engines like SGLang, vLLM, etc. Naive Attention Calculation Before we figure out how Flash Attention works, let’s first take a look at the naive attention calculation. \[\begin{align} \text{a...| Biao's Blog
Knowledge distillation is a model compression technique whereby a small network (student) is taught by a larger trained neural network (teacher). I. What is model distillation? Model distillation is a technique used to transfer knowledge from a larger, more complex model (the “teacher” model) to a smaller, simpler model (the “student” model) in order to improve the performance of the smaller model. The basic idea is that the teacher model has already learned useful information from th...| Biao's Blog
A blog about my thoughts on ML Sys and LLMs| Biao's Blog