Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Jingru Yi,Burak Uzkent,Oana Ignat,Zili Li,Amanmeet Garg,Xiang Yu,Linda Liu
2023-11-05
Abstract:Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flipping to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flipping. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics. Code can be found in <a class="link-external link-https" href="https://github.com/amzn/augment-the-pairs-wacv2024" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of data augmentation in visual language models (particularly localization-based visual language models) for phrase localization tasks. Specifically: - **Importance of Data Augmentation**: The study highlights the significant role of data augmentation in visual language tasks, especially when the amount of training data is limited. Although data augmentation has been widely used in tasks like object detection, there is relatively little research on data augmentation for image-text pairs in visual language tasks. - **Problems with Existing Methods**: Existing data augmentation methods (such as color jittering, horizontal flipping, etc.) when directly applied to image-text pairs, can easily disrupt the correspondence between the image and the text, leading to a decline in model performance. - **Proposed Method**: The authors propose a new text-conditional data augmentation method, including text-conditional color jittering and horizontal flipping, and introduce pixel-level and block-level masking as non-text-conditional data augmentation techniques. These methods ensure that the image and its corresponding text description remain consistent during data augmentation, thereby improving the model's generalization ability. Experimental results on multiple benchmark datasets show that this method significantly outperforms existing baseline methods in phrase localization tasks. Additionally, combining it with large-scale pre-trained image encoders (such as CLIP) further enhances the model's performance.