Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Jingru Yi,Burak Uzkent,Oana Ignat,Zili Li,Amanmeet Garg,Xiang Yu,Linda Liu

2023-11-05

Abstract:Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flipping to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flipping. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics. Code can be found in <a class="link-external link-https" href="https://github.com/amzn/augment-the-pairs-wacv2024" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of data augmentation in visual language models (particularly localization-based visual language models) for phrase localization tasks. Specifically: - **Importance of Data Augmentation**: The study highlights the significant role of data augmentation in visual language tasks, especially when the amount of training data is limited. Although data augmentation has been widely used in tasks like object detection, there is relatively little research on data augmentation for image-text pairs in visual language tasks. - **Problems with Existing Methods**: Existing data augmentation methods (such as color jittering, horizontal flipping, etc.) when directly applied to image-text pairs, can easily disrupt the correspondence between the image and the text, leading to a decline in model performance. - **Proposed Method**: The authors propose a new text-conditional data augmentation method, including text-conditional color jittering and horizontal flipping, and introduce pixel-level and block-level masking as non-text-conditional data augmentation techniques. These methods ensure that the image and its corresponding text description remain consistent during data augmentation, thereby improving the model's generalization ability. Experimental results on multiple benchmark datasets show that this method significantly outperforms existing baseline methods in phrase localization tasks. Additionally, combining it with large-scale pre-trained image encoders (such as CLIP) further enhances the model's performance.

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Cap2Aug: Caption guided Image to Image data Augmentation

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Improving Multimodal Datasets with Image Captioning

MixGen: A New Multi-Modal Data Augmentation

Utilizing Text-based Augmentation to Enhance Video Captioning

Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

CapsFusion: Rethinking Image-Text Data at Scale

PairAug: What Can Augmented Image-Text Pairs Do for Radiology?

Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

M3ixup: A Multi-Modal Data Augmentation Approach for Image Captioning

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation

Visual Cluster Grounding for Image Captioning

Image Captioning with Multi-Context Synthetic Data

Benchmarking and Improving Detail Image Caption

Multimodality-guided Visual-Caption Semantic Enhancement