On one of the C programming subreddits today I saw that some people were under the impression that in order to build a mutex in C, atomic operations must be used. There’s a performance argume…| Chris Feilbach's Blog
Our paper on NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning has been accepted at the 31st International European on Parallel and Distributed Computing (Euro-Par-2025). Abstract: Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput, which would […]| Dirk Kutscher
How Amazon used the NVIDIA NeMo framework, GPUs and EFA from AWS to train some of its largest next-generation LLMs.| NVIDIA Blog