Abstract:Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and communication cost. Current approaches generally assume uniform training workloads across samples in a given task. Thus, existing systems are designed to adopt a static parallelism strategy throughout one training process. Nevertheless, when training models with sequence inputs, this assumption fails due to the sequence length variation across samples. Consequently, training with a static parallelism strategy would result in sub-optimal performance. In this paper, we first reveal the under-explored fact that the optimal parallelism strategy varies even for the sequences within a single mini-batch. Motivated by this, we present HotSPa, a novel system that adopts multiple parallelism strategies for efficient training with sequence inputs. To be specific, given a mini-batch of training sequences, HotSPa partitions them into multiple groups and applies different parallelism strategies to process each group individually. To enable the hot switching between strategies, HotSPa transfers model parameters and accumulated gradients among the devices on the fly. Significant solutions are proposed with the hope of seamless and rapid parallelism hot switching. Firstly, we design a graph compiler, which generates distributed computation graphs for different parallelism strategies simultaneously, and orchestrates them to share a single model storage backbone. Secondly, we develop a simple yet effective hot switch planner, which heuristically deduces communication plans to accelerate the transition of model partitioning given any pairs of strategies. Extensive experiments on large language model training demonstrate that HotSPa can be up to 2.99× faster than Megatron-LM and DeepSpeed that utilize static parallelism strategies. Source code is available: https://github.com/PKU-DAIR/Hetu.

Simulation-Based Parallel Training

A Parallel Simulator for Massive Reservoir Models Utilizing Distributed-Memory Parallel Systems

Training Deep Surrogate Models with Large Scale Online Learning

Parallel Learning by Multitasking Neural Networks

Feasibility Study on Active Learning of Smart Surrogates for Scientific Simulations

Parallel Learning - A New Framework for Machine Learning

In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD

Dynamic Universal Approximation Theory: Foundations for Parallelism in Neural Networks

Coupled online learning as a way to tackle instabilities and biases in neural network parameterizations

Parallel Learning: a Perspective and a Framework

A probabilistic framework for learning non-intrusive corrections to long-time climate simulations from short-time training data

A Novel Parallel Framework for Pursuit Learning Schemes

On the Relationships between Graph Neural Networks for the Simulation of Physical Systems and Classical Numerical Methods

Predict globally, correct locally: Parallel-in-time optimization of neural networks

Neural-Parareal: Dynamically Training Neural Operators as Coarse Solvers for Time-Parallelisation of Fusion MHD Simulations

Parallel Machine Learning of Partial Differential Equations

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

Learning to Simulate High Energy Particle Collisions from Unlabeled Data

On Fast Simulation of Dynamical System with Neural Vector Enhanced Numerical Solver

MelissaDL x Breed: Towards Data-Efficient On-line Supervised Training of Multi-parametric Surrogates with Active Learning

Differentiable Multi-Fidelity Fusion: Efficient Learning of Physics Simulations with Neural Architecture Search and Transfer Learning