Data Distillation: A Survey

Noveen Sachdeva,Julian McAuley
2023-09-26
Abstract:The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions.
Machine Learning,Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the multi - faceted challenges faced when training deep - learning models on large - scale datasets, such as: 1. **High model training time**: Due to the need to process a large amount of data, the time cost of training large - scale models is very high. 2. **Slow research iteration speed**: The long - term training process slows down the research iteration speed and affects the efficiency of experiments and hypothesis verification. 3. **Ecological non - sustainability**: Large - scale data processing and model training consume a large amount of energy and have an adverse impact on the environment. To solve these problems, the paper proposes the data distillation method. The goal of data distillation is to synthesize highly streamlined data summaries (tiny and high - fidelity data summaries), which can effectively replace the original dataset for tasks such as model training, inference, and architecture search. Specifically, the data distillation method aims to solve problems in the following ways: - **Improve training efficiency**: By using a smaller dataset for training, significantly reduce the consumption of training time and computing resources. - **Accelerate research iteration**: A faster training process enables researchers to test and verify different models and algorithms more quickly. - **Improve ecological sustainability**: Reduce the use of computing resources, thereby reducing the carbon footprint. The paper also details different methods and technical classifications of data distillation, including methods based on meta - model matching, gradient matching, trajectory matching, and distribution matching, and discusses the applications of these methods on different data modalities (such as images, graphs, user - item interactions, etc.), as well as the challenges they face and future research directions.