Abstract:In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the infrastructure challenges faced in large - scale language model training, especially in distributed, bandwidth - limited, and node - dynamically - changing environments. Specifically, the paper mainly focuses on the following issues: 1. **Bandwidth Limitation**: Traditional distributed training methods rely on high - bandwidth inter - network in data centers (such as InfiniBand), but in geographically distributed environments, the bandwidth between nodes may be as low as several hundred Mb/s to several Gb/s, far lower than the bandwidth in typical high - performance computing environments. This makes communication the main bottleneck in distributed training. 2. **Node Dynamic Changes**: During the distributed training process, nodes may dynamically join or leave the training process, making system reliability a key issue. Especially in the community - driven training mode, the unpredictability of nodes is higher. 3. **Resource Utilization Efficiency**: How to maintain high - efficient computing resource utilization and ensure the convergence and stability of the training process in the case of bandwidth - limited and node - dynamically - changing situations. To solve these problems, the paper introduces INTELLECT - 1, a 10 - billion - parameter language model trained through global distributed node collaboration. The authors developed a framework named PRIME, which is specifically used for efficient and fault - tolerant distributed training on unreliable and bandwidth - limited nodes. The key innovations of PRIME include: - **ElasticDeviceMesh**: A new abstraction layer that manages fault - tolerant communication across the Internet and efficient communication within nodes. - **DiLoCo Algorithm**: A distributed low - communication algorithm that reduces communication volume by reducing the synchronization frequency. - **Int8 Quantization**: Further reduces the communication bandwidth requirement by performing 8 - bit integer quantization on the pseudo - gradient. - **Hybrid FSDP and DiLoCo**: Combines fully sharded data parallel (FSDP) and DiLoCo to optimize local and cross - node training efficiency. - **Fault - Tolerant Mechanisms**: Include dynamic node management, real - time checkpoint recovery, etc., to deal with the joining and leaving of nodes. Through these technologies, the paper shows that large - scale language models can still be efficiently trained in bandwidth - limited and node - dynamically - changing environments, providing new possibilities for future decentralized, community - driven AI model training. ### Summary The core problem of the paper is to explore how to achieve efficient training of large - scale language models in distributed environments with bandwidth - limited and node - dynamically - changing situations. By introducing the PRIME framework, the authors successfully trained the INTELLECT - 1 model in a network composed of multiple global nodes, proving the feasibility and potential of decentralized training.

INTELLECT-1 Technical Report

Optimizing Distributed Training on Frontier for Large Language Models

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Pretraining Billion-scale Geospatial Foundational Models on Frontier

Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery

Apple Intelligence Foundation Language Models

Decentralized Training of Foundation Models in Heterogeneous Environments

Efficient Large-Scale Language Model Training on GPU Clusters

INTELLECT: Adapting Cyber Threat Detection to Heterogeneous Computing Environments

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

DiPaCo: Distributed Path Composition

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Accelerating Large Language Model Training with In-Package Optical Links for Scale-Out Systems

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

52B to 1T: Lessons Learned via Tele-FLM Series

Elixir: Train a Large Language Model on a Small GPU Cluster