Haohui Mai is affiliated with the company CausalFlow.ai.| AMD ROCm Blogs
This blog is part of a series of walkthroughs of Life Science AI models, stemming from this article, which was a collaborative effort between AstraZeneca and AMD. The series delves into what was required in order to run drug discovery related AI workloads on AMD MI300X. The first post in this series, available here, focuses on REINVENT4, a molecular design tool used to generate and optimize candidate molecules. This blog, in particular, looks at SemlaFlow, an efficient 3D molecular generation...| AMD ROCm Blogs
In this blog we explore how to use GSplat, a GPU-optimized Python library for training and rendering 3DGS models, on AMD devices. This tutorial will guide you through training a model of a scene from a set of captured images, which will then allow you to render novel views of the scene. We use a port of the original GSplat code that has been optimized for AMD GPUs. The examples used throughout this blog were trained and rendered using an AMD MI300X GPU.| AMD ROCm Blogs
Retrieval-Augmented Generation (RAG) is a machine learning architecture that enhances Large Language Models (LLMs) by combining generation with information retrieval from external sources. It was introduced to address the limitations of traditional LLMs by allowing them to access and utilize up-to-date information from internal and/or external knowledge bases. When a query is received, RAG first retrieves relevant documents or information from its knowledge bases, then uses this retrieved con...| AMD ROCm Blogs
Modern AI workloads often don’t utilize the full capacity of advanced GPU hardware, especially when running smaller models or during development phases. The AMD GPU partitioning feature addresses this challenge by allowing you to divide physical GPUs into multiple virtual GPUs, dramatically improving resource utilization and cost efficiency.| AMD ROCm Blogs
FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.| AMD ROCm Blogs
In this blog post, we walk through how to use Matrix Cores in HIP kernels, with a focus on low-precision data types such as FP16, FP8, and FP4, as well as the new family of Matrix Core instructions with exponent block scaling introduced in the AMD CDNA™4 architecture. Through code examples and illustrations, we provide the necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types, the Matrix Core compiler intrinsics, and the data layouts requi...| AMD ROCm Blogs
The rapid rise of AI-assisted development is transforming how software is built, with coding agents emerging as powerful tools for modern developers. In this blog, we will show you how to deploy coding agents on AMD GPUs using frameworks such as SGLang, vLLM, and llama.cpp, and walk through a practical workflow example: creating a Minesweeper game using Aider.| AMD ROCm Blogs
AMD is excited to provide Day-0 support for the SGLang-native RL framework, slime. In this post, we will provide more details about our support and optimizations, as well as slime’s benefits for large-scale RL training. First, we describe the engineering efforts behind slime—including codebase modification, kernel-level memory management for ROCm™ software, and modifications to third-party dependencies (Megatron-LM, SGLang, and torch_memory_saver)—as well as Docker images that enable ...| AMD ROCm Blogs
This blog will highlight AMD ROCm’s ability to power next-generation audio-to-video models with simple, reproducible workflows.| ROCm Blogs
In this blog, we will provide step by step instruction on how to reproduce AMD's MLPerf Inference v5.1 Submission| ROCm Blogs
Enabling ROCm support for FastVideo inference using TeaCache on AMD Instinct GPUs, accelerating video generation with optimized backends| ROCm Blogs
We present AMD Hummingbird, offering a two-stage distillation framework for efficient, high-quality text-to-video generation using compact models.| ROCm Blogs
Fine-tune Llama 3.2 Vision models on AMD MI300X GPU using Torchtune, achieving 2.3× better accuracy with 11B vs 90B model on chart-based tasks.| ROCm Blogs
The blog explains the reasons behind RCCL bandwidth limitations and xGMI performance constraints, and provides actionable steps to maximize link efficiency on AMD MI300X| ROCm Blogs
A practical guide for accelerating video generation with HunyuanVideo and Wan 2.1 using Unified Sequence Parallelism on AMD GPUs.| ROCm Blogs
vLLM v1 on AMD ROCm boosts LLM serving with faster TTFT, higher throughput, and optimized multimodal support—ready out of the box.| ROCm Blogs
A step-by-step guide to reproducing AMD’s MLPerf v5.0 results for Llama 2 70B & SDXL using ROCm on MI325X| ROCm Blogs
AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs). , In this blog we explain how the Instella models were trained, and how to access them.| ROCm Blogs
AMD's GPU training optimizations deliver peak performance for advanced AI models through ROCm software stack.| ROCm Blogs
Learn how to optimize large language model inference using vLLM on AMD's MI300X GPUs for enhanced performance and efficiency.| ROCm Blogs
This post, the second in a series, provides a walkthrough for building a vLLM container that can be used for both inference and benchmarking.| ROCm Blogs
Featured Posts| rocm.blogs.amd.com
In this post we present a brief preview of AMD's Next-Gen Fortran Compiler, our new open source Fortran complier optimized for AMD GPUs using OpenMP offloading, offering direct interface to ROCm and HIP.| ROCm Blogs