PyTorch now offers native quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B and gemma-3-270m-it through a collaboration between the TorchAO team, ExecuTorch team, and Unsloth! These models leverage int4 and float8 quantization...| PyTorch
On October 21st, the AI Infra Summit comes to San Francisco and PyTorch Conference 2025, bringing together experts building the infrastructure behind the latest explosion in AI innovation. This half-day...| PyTorch
The Challenge of PyTorch 2.0 Compilation Since the release of PyTorch 2.0 (PT2) and its powerful new compilation infrastructure, researchers and engineers have benefited from dramatic improvements in model execution...| PyTorch
PyTorch 2.8 has just been released with a set of exciting new features, including a limited stable libtorch ABI for third-party C++/CUDA extensions, high-performance quantized LLM inference on Intel CPUs with native PyTorch, experimental Wheel Variant Support, inductor CUTLASS backend support, etc. Among all these features, one of the great things is that PyTorch can now provide competitive Large Language Model (LLM) low-precision performance on Intel Xeon platform as compared with other popu...| pytorch.org
By submitting this form, I consent to receive marketing emails from the LF and its projects regarding their events, training, research, developments, and related announcements. I understand that I can unsubscribe at any time using the links in the footers of the emails I receive. Privacy Policy.| pytorch.org
PyTorch Foundation is the deep learning community home for the open source PyTorch framework and ecosystem.| PyTorch
Intel announces a major enhancement for distributed training in PyTorch 2.8: the native integration of the XCCL backend for Intel® GPUs (Intel® XPU devices). This provides support for Intel® oneAPI...| PyTorch
Key takeaways: PyTorch and vLLM have been organically integrated to accelerate cutting-edge generative AI applications, such as inference, post-training and agentic systems. Prefill/Decode Disaggregation is a crucial technique for enhancing...| PyTorch
As training jobs become larger, the likelihood of failures such as preemptions, crashes, or infrastructure instability rises. This can lead to significant inefficiencies in training and delays in time-to-market. At...| PyTorch
Be one of the first to take our brand new PyTorch Associate Training Course which will be offered in-person at the PyTorch Conference on Tuesday, October 21, 2025. Whether you...| PyTorch
A few months back, I traveled to Berlin to attend the WeAreDevelopers World Congress. During the event, I had the pleasure of hosting a hands-on workshop. As a first-time workshop...| PyTorch
In this blog post, we explore the kernel design details presented in the paper Fast and Simplex: 2-Simplicial Attention in Triton [1]. We begin by modeling the 2-Simplicial attention algorithm...| PyTorch
tldr: 1.22x – 1.28x training acceleration with MXFP8, equivalent convergence compared to BF16.| pytorch.org
Key Takeaways:| pytorch.org
We’re thrilled to announce that the Kubeflow Trainer project has been integrated into the PyTorch ecosystem! This integration ensures that Kubeflow Trainer aligns with PyTorch’s standards and practices, giving developers a reliable, scalable, and community-backed solution to run PyTorch on Kubernetes.| pytorch.org
Diffusers is the go-to library that provides a unified interface to cutting-edge and open diffusion models for image, video, and audio. Over the past few months, we have deepened its integration with torch.compile. By tailoring the compilation workflow to the diffusion model architecture, torch.compile delivers significant speed-ups with minimal impact on user experience. In this post, we will show how to unlock these gains. The target audience for this post is| pytorch.org
Introduction and Context| pytorch.org
Access and install previous PyTorch versions, including binaries and instructions for all platforms.| PyTorch
Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference. This has contributed to a massive increase in LLM context length in the last two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), or even 1M (Ll...| pytorch.org
We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:| pytorch.org
PyTorch* 2.6 has just been released with a set of exciting new features including torch.compile compatibility with Python 3.13, new security and performance enhancements, and a change in the default parameter for torch.load. PyTorch also announced the deprecation of its official Anaconda channel.| pytorch.org
We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2: torch.compile can now be used with Python 3.13; new performance-related knob torch.compiler.set_stance; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.| pytorch.org
Transitioning from torch.distributed.launch to torchrun¶| pytorch.org
Wrapper for C++ torch::jit::Module with methods, attributes, and parameters.| pytorch.org
torch.device¶| pytorch.org
Implements data parallelism at the module level.| pytorch.org
by| PyTorch
A wrapper for sharding module parameters across data parallel workers.| pytorch.org
Getting Started with Distributed Data Parallel¶| pytorch.org
In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac. Until now, PyTorch training on Mac only leveraged the CPU, but with the upcoming PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.| pytorch.org
Efficient training of modern neural networks often relies on using lower precision data types. Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs. And since the float16 and bfloat16 data types are only half the size of float32 they can double the performance of bandwidth-bound kernels and reduce the memory required to train a network, allowing for larger models, larger batches, or larger inputs. Using a module like torch.am...| pytorch.org
Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for mixed-precision training, which combined single-precision (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance be...| pytorch.org
Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! Learn more.| PyTorch
torch.Tensor¶| pytorch.org
Per-parameter options¶| pytorch.org
Applies Batch Normalization over a 4D input.| pytorch.org
This post is the second part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native PyTorch. In this blog we’ll focus on LLM optimization.| PyTorch
This criterion computes the cross entropy loss between input logits| pytorch.org
Debugging¶| pytorch.org
Instances of autocast serve as context managers or decorators that| pytorch.org
torchvision¶| pytorch.org
Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! Learn more.| PyTorch
Base class for all neural network modules.| pytorch.org
The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Lightning AI has joined as a premier member.| PyTorch
Table of Contents| pytorch.org
Eager Mode Quantization¶| pytorch.org
Loading Batched and Non-Batched Data¶| pytorch.org
A guide to torch.cuda, a PyTorch module to run CUDA operations| pytorch.org
torch.nn¶| pytorch.org
Allows the model to jointly attend to information from different representation subspaces.| pytorch.org
If you installed PyTorch-nightly on Linux via pip between December 25, 2022 and December 30, 2022, please uninstall it and torchtriton immediately, and use the latest nightly binaries (newer than Dec 30th 2022).| PyTorch