[!NOTE] This blog originated from our biweekly vLLM office hours, a community forum hosted by Red Hat with vLLM project committers and the UC Berkeley team. Each session covers recent updates, a deep dive with a guest speaker, and open Q&A. Join us every other Thursday at 2:00 PM ET / 11:00 AM PT on Google Meet, and get the recording and slides afterward on our YouTube playlist.| vLLM Blog
vLLM is a fast and easy-to-use library for LLM inference and serving.| vLLM Blog
Introduction| vLLM Blog
We’re thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of gpt-oss and how vLLM supports it.| vLLM Blog
This article explores how MiniMax-M1’s hybrid architecture is efficiently supported in vLLM. We discuss the model’s unique features, the challenges of efficient inference, and the technical solutions implemented in vLLM.| vLLM Blog
Since December 2024, through the joint efforts of the vLLM community and the Ascend team on vLLM, we have completed the Hardware Pluggable RFC. This proposal allows hardware integration into vLLM in a decoupled manner, enabling rapid and modular support for different hardware platforms.| vLLM Blog
As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead. This challenge is particularly pronounced with models that excel at complex reasoning tasks (such as OpenAI-o1 and DeepSeek-R1), where generating long chain-of-thought (CoT)...| vLLM Blog
The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, transformers is the go-to toolkit for all.| vLLM Blog
We’re excited to announce that vLLM now supports the Llama 4 herd of models: Scout (17B-16E) and Maverick (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:| vLLM Blog
TL;DR: vLLM on AMD ROCm now has better FP8 performance!| vLLM Blog
vLLM is a fast and easy-to-use library for LLM inference and serving.| vLLM Blog
TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.| vLLM Blog
GitHub | Documentation | Paper| vLLM Blog