Abstract:Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and communication cost. Current approaches generally assume uniform training workloads across samples in a given task. Thus, existing systems are designed to adopt a static parallelism strategy throughout one training process. Nevertheless, when training models with sequence inputs, this assumption fails due to the sequence length variation across samples. Consequently, training with a static parallelism strategy would result in sub-optimal performance. In this paper, we first reveal the under-explored fact that the optimal parallelism strategy varies even for the sequences within a single mini-batch. Motivated by this, we present HotSPa, a novel system that adopts multiple parallelism strategies for efficient training with sequence inputs. To be specific, given a mini-batch of training sequences, HotSPa partitions them into multiple groups and applies different parallelism strategies to process each group individually. To enable the hot switching between strategies, HotSPa transfers model parameters and accumulated gradients among the devices on the fly. Significant solutions are proposed with the hope of seamless and rapid parallelism hot switching. Firstly, we design a graph compiler, which generates distributed computation graphs for different parallelism strategies simultaneously, and orchestrates them to share a single model storage backbone. Secondly, we develop a simple yet effective hot switch planner, which heuristically deduces communication plans to accelerate the transition of model partitioning given any pairs of strategies. Extensive experiments on large language model training demonstrate that HotSPa can be up to 2.99× faster than Megatron-LM and DeepSpeed that utilize static parallelism strategies. Source code is available: https://github.com/PKU-DAIR/Hetu.

Accelerating the Training of Large Language Models Using Efficient Activation Rematerialization and Optimal Hybrid Parallelism.

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

An Efficient 2D Method for Training Super-Large Deep Learning Models

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

Optimizing Large Model Training through Overlapped Activation Recomputation

Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Optimizing Distributed Training on Frontier for Large Language Models

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Efficient Large-Scale Language Model Training on GPU Clusters