INTELLECT-1 Technical Report

Sami Jaghouar,Jack Min Ong,Manveer Basra,Fares Obeid,Jannik Straube,Michael Keiblinger,Elie Bakouch,Lucas Atkins,Maziyar Panahi,Charles Goddard,Max Ryabinin,Johannes Hagemann
2024-12-02
Abstract:In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the infrastructure challenges faced in large - scale language model training, especially in distributed, bandwidth - limited, and node - dynamically - changing environments. Specifically, the paper mainly focuses on the following issues: 1. **Bandwidth Limitation**: Traditional distributed training methods rely on high - bandwidth inter - network in data centers (such as InfiniBand), but in geographically distributed environments, the bandwidth between nodes may be as low as several hundred Mb/s to several Gb/s, far lower than the bandwidth in typical high - performance computing environments. This makes communication the main bottleneck in distributed training. 2. **Node Dynamic Changes**: During the distributed training process, nodes may dynamically join or leave the training process, making system reliability a key issue. Especially in the community - driven training mode, the unpredictability of nodes is higher. 3. **Resource Utilization Efficiency**: How to maintain high - efficient computing resource utilization and ensure the convergence and stability of the training process in the case of bandwidth - limited and node - dynamically - changing situations. To solve these problems, the paper introduces INTELLECT - 1, a 10 - billion - parameter language model trained through global distributed node collaboration. The authors developed a framework named PRIME, which is specifically used for efficient and fault - tolerant distributed training on unreliable and bandwidth - limited nodes. The key innovations of PRIME include: - **ElasticDeviceMesh**: A new abstraction layer that manages fault - tolerant communication across the Internet and efficient communication within nodes. - **DiLoCo Algorithm**: A distributed low - communication algorithm that reduces communication volume by reducing the synchronization frequency. - **Int8 Quantization**: Further reduces the communication bandwidth requirement by performing 8 - bit integer quantization on the pseudo - gradient. - **Hybrid FSDP and DiLoCo**: Combines fully sharded data parallel (FSDP) and DiLoCo to optimize local and cross - node training efficiency. - **Fault - Tolerant Mechanisms**: Include dynamic node management, real - time checkpoint recovery, etc., to deal with the joining and leaving of nodes. Through these technologies, the paper shows that large - scale language models can still be efficiently trained in bandwidth - limited and node - dynamically - changing environments, providing new possibilities for future decentralized, community - driven AI model training. ### Summary The core problem of the paper is to explore how to achieve efficient training of large - scale language models in distributed environments with bandwidth - limited and node - dynamically - changing situations. By introducing the PRIME framework, the authors successfully trained the INTELLECT - 1 model in a network composed of multiple global nodes, proving the feasibility and potential of decentralized training.