Abstract:The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the multi - faceted challenges faced when training deep - learning models on large - scale datasets, such as: 1. **High model training time**: Due to the need to process a large amount of data, the time cost of training large - scale models is very high. 2. **Slow research iteration speed**: The long - term training process slows down the research iteration speed and affects the efficiency of experiments and hypothesis verification. 3. **Ecological non - sustainability**: Large - scale data processing and model training consume a large amount of energy and have an adverse impact on the environment. To solve these problems, the paper proposes the data distillation method. The goal of data distillation is to synthesize highly streamlined data summaries (tiny and high - fidelity data summaries), which can effectively replace the original dataset for tasks such as model training, inference, and architecture search. Specifically, the data distillation method aims to solve problems in the following ways: - **Improve training efficiency**: By using a smaller dataset for training, significantly reduce the consumption of training time and computing resources. - **Accelerate research iteration**: A faster training process enables researchers to test and verify different models and algorithms more quickly. - **Improve ecological sustainability**: Reduce the use of computing resources, thereby reducing the carbon footprint. The paper also details different methods and technical classifications of data distillation, including methods based on meta - model matching, gradient matching, trajectory matching, and distribution matching, and discusses the applications of these methods on different data modalities (such as images, graphs, user - item interactions, etc.), as well as the challenges they face and future research directions.

Data Distillation: A Survey

A Comprehensive Survey of Dataset Distillation

A Survey on Dataset Distillation: Approaches, Applications and Future Directions

Dataset Distillation: A Comprehensive Review

Data-to-Model Distillation: Data-Efficient Learning Framework

Knowledge Distillation: A Survey

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

What is Dataset Distillation Learning?

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

One Category One Prompt: Dataset Distillation using Diffusion Models

Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Data-Efficient Generation for Dataset Distillation

Data Distillation: Towards Omni-Supervised Learning

Behaviour Distillation

Dataset Distillation in Medical Imaging: A Feasibility Study

Data-Free Adversarial Distillation

Curriculum Dataset Distillation

A Label is Worth a Thousand Images in Dataset Distillation

Importance-Aware Adaptive Dataset Distillation

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation