ITMix: Image-Text Mix Augmentation for Transferring CLIP to Image Classification

Tao Hong,Xiangyang Guo,Jinwen Ma
DOI: https://doi.org/10.1109/icsp56322.2022.9965292
2022-01-01
Abstract:The success of cross-modal models like CLIP has sparked researchers’ interest in better understanding the interaction between different modalities recently. Inspired by the valuable zero-shot image classification experiment of CLIP, we focus on data augmentation when transferring CLIP for finetuning downstream classification tasks in this paper. Mix series like Mixup or CutMix is an effective data augmentation method that generates new images by interpolating between different samples. Different from the common mix series which only concentrates on augmentation of image modality, we intend to mix image and text modalities simultaneously, named ITMix. In this way, more abundant matched image-text pairs would be created. For the implementation of ITMix, effective fine-tuning with match loss and soft one-to-more mapping are proposed. The experimental results verify the outperformance of our proposed method in terms of accuracy on different image classification benchmarks: CIFAR-10, CIFAR-100, Food-101, etc.
What problem does this paper attempt to address?