Vision-Language Dataset Distillation

Xindi Wu,Byron Zhang,Zhiwei Deng,Olga Russakovsky
2024-08-20
Abstract:Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of visual-language dataset distillation. Specifically, existing dataset distillation methods mainly focus on image classification datasets, while modern large-scale datasets are primarily visual-language datasets. These datasets do not have discrete category sets to support the distillation process and contain complex cross-modal connections and redundancies, requiring a joint distillation method to effectively capture their interdependencies. Additionally, the complexity of visual-language models (VLMs) brings computational challenges. Therefore, the paper proposes a novel visual-language dataset distillation method that can compress large-scale datasets into smaller synthetic datasets while retaining key information and remaining effective on high-resolution images and complex models. Experimental results show that this method significantly improves image-text retrieval performance on the Flickr30K and COCO retrieval benchmarks. For example, on Flickr30K, when the best baseline method selects 1000 image-text pairs for training, the image-to-text retrieval accuracy is only 5.6%, while the method proposed in the paper almost doubles this metric to 9.9% with only 100 image-text pairs.