Vision-Language Dataset Distillation

Xindi Wu,Byron Zhang,Zhiwei Deng,Olga Russakovsky

2024-08-20

Abstract:Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of visual-language dataset distillation. Specifically, existing dataset distillation methods mainly focus on image classification datasets, while modern large-scale datasets are primarily visual-language datasets. These datasets do not have discrete category sets to support the distillation process and contain complex cross-modal connections and redundancies, requiring a joint distillation method to effectively capture their interdependencies. Additionally, the complexity of visual-language models (VLMs) brings computational challenges. Therefore, the paper proposes a novel visual-language dataset distillation method that can compress large-scale datasets into smaller synthetic datasets while retaining key information and remaining effective on high-resolution images and complex models. Experimental results show that this method significantly improves image-text retrieval performance on the Flickr30K and COCO retrieval benchmarks. For example, on Flickr30K, when the best baseline method selects 1000 image-text pairs for training, the image-to-text retrieval accuracy is only 5.6%, while the method proposed in the paper almost doubles this metric to 9.9% with only 100 image-text pairs.

Vision-Language Dataset Distillation

Soft-Label Dataset Distillation and Text Dataset Distillation

Dataset Distillation by Matching Training Trajectories

Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Dynamic Contrastive Distillation for Image-Text Retrieval

Curriculum Dataset Distillation

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Low-Rank Similarity Mining for Multimodal Dataset Distillation

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

Enhancing Dataset Distillation via Label Inconsistency Elimination and Learning Pattern Refinement

A Comprehensive Survey of Dataset Distillation

Data-Efficient Generation for Dataset Distillation

Dataset Distillation: A Comprehensive Review

What is Dataset Distillation Learning?

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models

Dataset Distillation by Automatic Training Trajectories

Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation

Distributional Dataset Distillation with Subtask Decomposition

DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation