DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Conglong Li,Zhewei Yao,Xiaoxia Wu,Minjia Zhang,Connor Holmes,Cheng Li,Yuxiong He

2024-01-15

Abstract:Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the high costs associated with training large-scale deep learning models. As the scale of models and data grows rapidly, the training costs become enormous, especially during the pre-training phase of foundational models. While the rapid evolution of model architectures has received widespread attention, how to efficiently utilize training data has been relatively overlooked. Existing data efficiency techniques, although improving training efficiency to some extent, still have limitations such as implementation complexity, lack of scalability, and customization. To solve these issues, the authors propose the **DeepSpeed Data Efficiency** framework, which leverages efficient data sampling and routing techniques to better utilize data, improve training efficiency, and enhance model quality. Specifically, the framework includes two main technologies: 1. **Efficient Curriculum Learning Library**: Used for efficiently analyzing and indexing large-scale datasets and customizing data sampling strategies based on different difficulty metrics. 2. **Random Layerwise Token Dropping (random-LTD)**: Reduces computation by randomly dropping a portion of input tokens independently at each intermediate layer while maintaining model performance. Through these technologies, DeepSpeed Data Efficiency can significantly reduce data consumption, training time, and costs while maintaining or improving model quality. For example, in the pre-training of the GPT-3 1.3B language model, this framework can reduce data/time/cost by 12.5 times while maintaining 95% of the model quality. For GPT-3 1.3B and BERT-large pre-training, the framework can also achieve 2 times data/time/cost savings or achieve better model quality with the same data/time/cost. Additionally, the DeepSpeed Data Efficiency framework is easy to use and tune, requiring only minor modifications for users to apply it to different tasks, including GPT-3 MoE model pre-training and small-scale GPT-2/ViT fine-tuning. The framework has been open-sourced, providing the community with a useful tool to apply curriculum learning and random layerwise token dropping techniques.

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

A Multi-Level Framework for Accelerating Training Transformer Models

Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance

Improving Large Models with Small models: Lower Costs and Better Performance

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Little Giants: Synthesizing High-Quality Embedding Data at Scale

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Does your data spark joy? Performance gains from domain upsampling at the end of training

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

An Efficient 2D Method for Training Super-Large Deep Learning Models

Accelerating Data Loading in Deep Neural Network Training

SmartDeal: Remodeling Deep Network Weights for Efficient Inference and Training

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models