DiPaCo: Distributed Path Composition

Arthur Douillard,Qixuan Feng,Andrei A. Rusu,Adhiguna Kuncoro,Yani Donchev,Rachita Chhaparia,Ionel Gog,Marc'Aurelio Ranzato,Jiajun Shen,Arthur Szlam
2024-03-16
Abstract:Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper proposes a solution to the problem of inefficient training and distributed computing in large-scale machine learning models. By introducing the DIstributed PAth COmposition (DiPaCo) architecture and training method, the paper aims to design a modular, distributed system that allows for efficient distribution of computing tasks among different devices, reducing communication costs and improving training robustness. DiPaCo uses path composition, where each path consists of a set of shared modules and only activates the required parts of the model during training and inference. This approach reduces reliance on highly interconnected devices, can handle heterogeneous computing resources, and reduces training time without sacrificing performance. The paper demonstrates the experiments with DiPaCo on language models, showing that it outperforms a 1.3-billion-parameter model in terms of performance under the same number of training steps, while reducing training time by 45%.