The Startup Showcase returns to the PyTorch Conference on Tuesday, October 21, 2025, spotlighting the most promising early-stage teams building real-world AI applications. The program gives founders a high-visibility platform...| PyTorch
Large Language Models (LLMs) have revolutionized how we write and consume documents. In the past year or so, we have started to see them a lot more than just rephrasing...| PyTorch
TL;DR NJTs (Nested Jagged Tensors) boost DRAMA model inference efficiency by 1.7x-2.3x, making it more production-ready in the category of LLM-based encoders, especially with variable-length sequences. Introduction and Context Recent...| PyTorch
Introduction ZenFlow is a new extension to DeepSpeed introduced in summer 2025, designed as a stall-free offloading engine for large language model (LLM) training. Offloading is a widely used technique...| PyTorch
In this post, we present an optimized Triton BF16 Grouped GEMM kernel for running training and inference on Mixture-of-Experts (MoE) models, such as DeepSeekv3. A Grouped GEMM applies independent GEMMs...| PyTorch
Mark your calendars! The inaugural Open Source AI Week is coming to the San Francisco Bay Area from October 18–26, 2025. This week-long celebration is the premier destination for the...| PyTorch
charliemarsh’s tweet, creator of uv PyTorch is the leading machine learning framework for developing and deploying some of the largest AI products from around the world. However, there is one...| PyTorch
On June 7, 2025, PyTorch Day China was held in Beijing, co-hosted by PyTorch Foundation and the Beijing Academy of Artificial Intelligence (BAAI). The one-day conference featured 16 talks and...| PyTorch
Key Takeaways:| pytorch.org
Introduction We integrate mixed and low-precision training with Opacus to unlock increased throughput and training with larger batch sizes. Our initial experiments show that one can maintain the same utility...| PyTorch
We’re thrilled to announce that the Kubeflow Trainer project has been integrated into the PyTorch ecosystem! This integration ensures that Kubeflow Trainer aligns with PyTorch’s standards and practices, giving developers a reliable, scalable, and community-backed solution to run PyTorch on Kubernetes.| pytorch.org
Diffusers is the go-to library that provides a unified interface to cutting-edge and open diffusion models for image, video, and audio. Over the past few months, we have deepened its integration with torch.compile. By tailoring the compilation workflow to the diffusion model architecture, torch.compile delivers significant speed-ups with minimal impact on user experience. In this post, we will show how to unlock these gains. The target audience for this post is| pytorch.org
Introduction and Context| pytorch.org
Summary| pytorch.org
In our earlier post, diffusion-fast, we showed how the Stable Diffusion XL (SDXL) pipeline can be optimized up to 3x using native PyTorch code. Back then, SDXL was an open SoTA pipeline for image generation. Quite unsurprisingly, a lot has changed since then, and it’s safe to say that Flux is now one of the most capable open-weight models in the space.| pytorch.org
Collaborators: Less Wright, Howard Huang, Chien-Chin Huang, Crusoe: Martin Cala, Ethan Petersen| pytorch.org
Access and install previous PyTorch versions, including binaries and instructions for all platforms.| PyTorch
Introduction| pytorch.org
Today, we’re announcing the availability of PyTorch 1.6, along with updated domain libraries. We are also excited to announce the team at Microsoft is now maintaining Windows builds and binaries and will also be supporting the community on GitHub as well as the PyTorch Windows discussion forums.| pytorch.org
PyTorch Foundation is the deep learning community home for the open source PyTorch framework and ecosystem.| PyTorch
Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference. This has contributed to a massive increase in LLM context length in the last two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), or even 1M (Ll...| pytorch.org
We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:| pytorch.org
PyTorch* 2.6 has just been released with a set of exciting new features including torch.compile compatibility with Python 3.13, new security and performance enhancements, and a change in the default parameter for torch.load. PyTorch also announced the deprecation of its official Anaconda channel.| pytorch.org
We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2: torch.compile can now be used with Python 3.13; new performance-related knob torch.compiler.set_stance; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.| pytorch.org
Transitioning from torch.distributed.launch to torchrun¶| pytorch.org
Wrapper for C++ torch::jit::Module with methods, attributes, and parameters.| pytorch.org
torch.device¶| pytorch.org
Implements data parallelism at the module level.| pytorch.org
by| PyTorch
A wrapper for sharding module parameters across data parallel workers.| pytorch.org
Getting Started with Distributed Data Parallel¶| pytorch.org
Implement distributed data parallelism based on torch.distributed at module level.| pytorch.org
In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac. Until now, PyTorch training on Mac only leveraged the CPU, but with the upcoming PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.| pytorch.org
Efficient training of modern neural networks often relies on using lower precision data types. Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs. And since the float16 and bfloat16 data types are only half the size of float32 they can double the performance of bandwidth-bound kernels and reduce the memory required to train a network, allowing for larger models, larger batches, or larger inputs. Using a module like torch.am...| pytorch.org
Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for mixed-precision training, which combined single-precision (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance be...| pytorch.org
Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! Learn more.| PyTorch
torch.Tensor¶| pytorch.org
Per-parameter options¶| pytorch.org
Applies Batch Normalization over a 4D input.| pytorch.org
This post is the second part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native PyTorch. In this blog we’ll focus on LLM optimization.| PyTorch
This criterion computes the cross entropy loss between input logits| pytorch.org
Debugging¶| pytorch.org
Instances of autocast serve as context managers or decorators that| pytorch.org
torchvision¶| pytorch.org
Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! Learn more.| PyTorch
Base class for all neural network modules.| pytorch.org
The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Lightning AI has joined as a premier member.| PyTorch
Table of Contents| pytorch.org
Eager Mode Quantization¶| pytorch.org
Loading Batched and Non-Batched Data¶| pytorch.org
A guide to torch.cuda, a PyTorch module to run CUDA operations| pytorch.org
torch.nn¶| pytorch.org
Allows the model to jointly attend to information from different representation subspaces.| pytorch.org
If you installed PyTorch-nightly on Linux via pip between December 25, 2022 and December 30, 2022, please uninstall it and torchtriton immediately, and use the latest nightly binaries (newer than Dec 30th 2022).| PyTorch