Abstract:Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and communication cost. Current approaches generally assume uniform training workloads across samples in a given task. Thus, existing systems are designed to adopt a static parallelism strategy throughout one training process. Nevertheless, when training models with sequence inputs, this assumption fails due to the sequence length variation across samples. Consequently, training with a static parallelism strategy would result in sub-optimal performance. In this paper, we first reveal the under-explored fact that the optimal parallelism strategy varies even for the sequences within a single mini-batch. Motivated by this, we present HotSPa, a novel system that adopts multiple parallelism strategies for efficient training with sequence inputs. To be specific, given a mini-batch of training sequences, HotSPa partitions them into multiple groups and applies different parallelism strategies to process each group individually. To enable the hot switching between strategies, HotSPa transfers model parameters and accumulated gradients among the devices on the fly. Significant solutions are proposed with the hope of seamless and rapid parallelism hot switching. Firstly, we design a graph compiler, which generates distributed computation graphs for different parallelism strategies simultaneously, and orchestrates them to share a single model storage backbone. Secondly, we develop a simple yet effective hot switch planner, which heuristically deduces communication plans to accelerate the transition of model partitioning given any pairs of strategies. Extensive experiments on large language model training demonstrate that HotSPa can be up to 2.99× faster than Megatron-LM and DeepSpeed that utilize static parallelism strategies. Source code is available: https://github.com/PKU-DAIR/Hetu.

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Coded Parallelism for Distributed Deep Learning.

Fast Parallel Training of Neural Language Models.

A Layer-Based Sparsification Method for Distributed DNN Training.

ACCELERATING THE TRAINING OF ARTIFICIAL NEURAL NETWORKS USING DATA PARALLELIZATION

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

GSSP: Eliminating Stragglers Through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster.

Aware: Adaptive Distributed Training with Computation, Communication and Position Awareness for Deep Learning Model.

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Modern Distributed Data-Parallel Large-Scale Pre-training Strategies For NLP models

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Nnscaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training.

Optimal distributed parallel algorithms for deep learning framework Tensorflow

Near-Linear Scaling Data Parallel Training with Overlapping-Aware Gradient Compression

Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning