Eugene Belilovsky - Toward Globally Distributed Training of Foundation Models

Eugene Belilovsky - Toward Globally Distributed Training of Foundation Models

Date: 15 juin 2026 à 13h
Salle: 54-55 201

Frontier foundation model training has been growing in scale and resource demands. It is largely dominated by homogenous centralized training clusters with co-located compute and expensive high-bandwidth interconnects, and is often limited by power consumption. Harnessing globally distributed and heterogenous computational resources for these jobs is bottlenecked by the communication cost of moving data between accelerators. This also often dictates how, where, and by whom these models can be trained. Building on a line of work in communication-efficient optimization, we will discuss a recent work considers practical low-bandwidth pre-training of foundation models, and particularly LLMs. We will first look at a line of work on data-parallel communication-efficient methods based on infrequent communication and gradient compression, discussing how these methods perform and scale to larger training scenarios. We will then consider settings where models far exceed the memory of individual accelerators, and how this can be addressed by low-bandwidth alternatives to traditional model parallelism that allow broader participation with lower-resource compute.