DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation

Aru Maekawa,Satoshi Kosugi,Kotaro Funakoshi,Manabu Okumura
2024-03-30
Abstract:Dataset distillation aims to compress a training dataset by creating a small number of informative synthetic samples such that neural networks trained on them perform as well as those trained on the original training dataset. Current text dataset distillation methods create each synthetic sample as a sequence of word embeddings instead of a text to apply gradient-based optimization; however, such embedding-level distilled datasets cannot be used for training other models whose word embedding weights are different from the model used for distillation. To address this issue, we propose a novel text dataset distillation approach, called Distilling dataset into Language Model (DiLM), which trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples. We evaluated DiLM on various text classification datasets and showed that distilled synthetic datasets from DiLM outperform those from current coreset selection methods. DiLM achieved remarkable generalization performance in training different types of models and in-context learning of large language models. Our code will be available at
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to solve several key problems in text dataset distillation: 1. **Limitations of existing methods**: - Current text dataset distillation methods usually generate synthetic samples by optimizing word embeddings rather than directly generating text. The synthetic samples generated by this method cannot be used to train other models with different word embedding weights, limiting their flexibility in practical applications. - The generated word embedding sequences are completely unreadable to humans, making it difficult to interpret and analyze the original training dataset. 2. **Challenges in optimizing discrete text**: - Due to the discrete nature of text, it is very difficult to directly optimize text to generate synthetic samples. Existing methods bypass this problem by optimizing continuous word embeddings, but the synthetic samples generated by this method cannot be directly used to train other models. 3. **Model - independent applications**: - Researchers hope to develop a method that can generate text - level synthetic datasets, which can be used to train different types of models, not just specific pre - trained models. ### Solutions To overcome the above problems, the paper proposes a new text dataset distillation method called "Distilling dataset into Language Model (DiLM)". Specifically, the main contributions of DiLM include: 1. **Generate text - level synthetic datasets**: - DiLM uses a language model to generate text - level synthetic samples rather than directly optimizing word embeddings. This makes the generated synthetic datasets can be used to train models with different word embedding weights, improving model - independence. 2. **Optimization method**: - To overcome the optimization difficulties of text discreteness, DiLM trains the language model by minimizing the gradient matching loss between the generated samples and the real samples. By designing a differentiable back - propagation path, DiLM can effectively optimize the language model parameters. 3. **Experimental verification**: - Researchers conducted experiments on multiple text classification datasets, and the results show that the synthetic datasets generated by DiLM not only perform better than the current coreset selection methods when training the same model, but also perform excellently when training different types of models, especially in the context learning of large language models (LLMs) under few - shot prompting. ### Conclusion DiLM solves the limitations of existing text dataset distillation methods by generating text - level synthetic datasets, improving the interpretability and model - independence of synthetic datasets. The experimental results show that DiLM performs excellently on multiple tasks and has broad application prospects.