I’m doing my master’s thesis around distributed low-communication training. Essentially, how can we train large models efficiently across distributed nodes and not be utterly destroyed by network latency and bandwidth? Below is some of what I’ve learned and investigated throughout the days. Day 3: Current Work on Heterogeneous Workers# A desirable problem to solve is being able to use different kinds of hardware for training. Even within the same generation, NVIDIA B300 GPUs are 50% fas...