Abstract:Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and communication cost. Current approaches generally assume uniform training workloads across samples in a given task. Thus, existing systems are designed to adopt a static parallelism strategy throughout one training process. Nevertheless, when training models with sequence inputs, this assumption fails due to the sequence length variation across samples. Consequently, training with a static parallelism strategy would result in sub-optimal performance. In this paper, we first reveal the under-explored fact that the optimal parallelism strategy varies even for the sequences within a single mini-batch. Motivated by this, we present HotSPa, a novel system that adopts multiple parallelism strategies for efficient training with sequence inputs. To be specific, given a mini-batch of training sequences, HotSPa partitions them into multiple groups and applies different parallelism strategies to process each group individually. To enable the hot switching between strategies, HotSPa transfers model parameters and accumulated gradients among the devices on the fly. Significant solutions are proposed with the hope of seamless and rapid parallelism hot switching. Firstly, we design a graph compiler, which generates distributed computation graphs for different parallelism strategies simultaneously, and orchestrates them to share a single model storage backbone. Secondly, we develop a simple yet effective hot switch planner, which heuristically deduces communication plans to accelerate the transition of model partitioning given any pairs of strategies. Extensive experiments on large language model training demonstrate that HotSPa can be up to 2.99× faster than Megatron-LM and DeepSpeed that utilize static parallelism strategies. Source code is available: https://github.com/PKU-DAIR/Hetu.

Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

US-Byte: an Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

High-Speed Data Communication with Advanced Networks in Large Language Model Training

Federated Learning-Based Cooperative Model Training for Task-Oriented Semantic Communication

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

On Optimizing the Communication of Model Parallelism

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Decentralized Training of Foundation Models in Heterogeneous Environments

How Useful is Communication Scheduling for Distributed Training?

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping