DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Conglong Li,Zhewei Yao,Xiaoxia Wu,Minjia Zhang,Connor Holmes,Cheng Li,Yuxiong He
2024-01-15
Abstract:Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the high costs associated with training large-scale deep learning models. As the scale of models and data grows rapidly, the training costs become enormous, especially during the pre-training phase of foundational models. While the rapid evolution of model architectures has received widespread attention, how to efficiently utilize training data has been relatively overlooked. Existing data efficiency techniques, although improving training efficiency to some extent, still have limitations such as implementation complexity, lack of scalability, and customization. To solve these issues, the authors propose the **DeepSpeed Data Efficiency** framework, which leverages efficient data sampling and routing techniques to better utilize data, improve training efficiency, and enhance model quality. Specifically, the framework includes two main technologies: 1. **Efficient Curriculum Learning Library**: Used for efficiently analyzing and indexing large-scale datasets and customizing data sampling strategies based on different difficulty metrics. 2. **Random Layerwise Token Dropping (random-LTD)**: Reduces computation by randomly dropping a portion of input tokens independently at each intermediate layer while maintaining model performance. Through these technologies, DeepSpeed Data Efficiency can significantly reduce data consumption, training time, and costs while maintaining or improving model quality. For example, in the pre-training of the GPT-3 1.3B language model, this framework can reduce data/time/cost by 12.5 times while maintaining 95% of the model quality. For GPT-3 1.3B and BERT-large pre-training, the framework can also achieve 2 times data/time/cost savings or achieve better model quality with the same data/time/cost. Additionally, the DeepSpeed Data Efficiency framework is easy to use and tune, requiring only minor modifications for users to apply it to different tasks, including GPT-3 MoE model pre-training and small-scale GPT-2/ViT fine-tuning. The framework has been open-sourced, providing the community with a useful tool to apply curriculum learning and random layerwise token dropping techniques.