Abstract:Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the Flickr30K benchmark, we show that, compared with the original data set, a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics. Furthermore, on top of relying on the classical BLEU metric, we conduct a fine-grained analysis of the improvements of our models against the baseline in different ways. We compared text-to-image generative models and found different behaviors of the image captioning models in terms of encoding visual encoding and textual decoding.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient generalization ability of artificial neural networks when processing samples beyond the context of the training data set. Specifically, the author points out that existing data sets usually contain only partial information about the potentially relevant structures of the world, which leads to poor performance of the model when facing unseen situations. To solve this problem, the paper proposes a method named TIDA (Targeted Image - editing Data Augmentation). TIDA edits images by using a text - to - image generation model to fill in the gaps in relevant structures in the data set, thereby improving the performance of the model in human - like skills (such as gender recognition, color recognition, and counting ability) in image captioning tasks. ### Main contributions of the paper 1. **Identifying data for specific human - like skills**: The TIDA method can identify data related to specific human - like skills, such as color recognition, emotion recognition, etc. 2. **Data augmentation based on image - to - text generation model**: TIDA generates new images by changing specific attributes (such as gender, color, quantity, etc.) in image captions, thereby creating a data set that can selectively improve specific human - like skills. ### Method overview 1. **Skill - related retrieval**: - Define a series of skills \( S=\{S_{i}, i = 1,\ldots,S\} \). - For each skill \( S_{i} \), create a binary classifier \( f_{S_{i}} \) to detect whether the skill is included in the image and its caption. - Apply this classifier to the data set \( D \), and extract a sub - data set \( D_{S_{i}} \) that contains specific skills. 2. **Targeted data augmentation**: - For each skill - related caption \( c_{k} \), use a text generation function \( G_{t,S_{i}} \) to generate a new caption \( c_{kli} \). - Use a text - to - image generation model \( G_{V} \) to generate a new image \( I_{kli} \) according to the new caption. - Add these newly generated image - caption pairs to the data set to form an augmented data set \( D_{\text{train}}^{\text{GV}-S_{i}} \). ### Experimental setup - **Data set**: Use the Flickr30K data set, which contains 31,000 images and 159,000 captions. - **Skills**: Focus on three basic human skills: gender recognition, counting ability, and color recognition. - **Baseline**: Compare with the data augmentation method of randomly generating images. ### Result analysis - **Classical metrics**: Use classical metrics such as BLEU and RefCLIPScore to evaluate the performance of the model. The results show that the data set augmented with TIDA performs better on multiple test sets. - **Skill - related words**: Evaluate the performance of the model in specific skills by analyzing the use of specific skill words in the captions generated by the model. The results indicate that the TIDA method is significantly effective in improving the performance of specific skills. ### Conclusion The TIDA method effectively improves the performance of the image captioning model in specific human - like skills, especially in gender recognition, color recognition, and counting ability, through targeted data augmentation. This provides a new idea for improving the generalization ability and robustness of the model.

Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

TTIDA: Controllable Generative Data Augmentation via Text-to-Text and Text-to-Image Models

Cap2Aug: Caption guided Image to Image data Augmentation

Explicit Image Caption Editing

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Not Just Pretty Pictures: Toward Interventional Data Augmentation Using Text-to-Image Generators

Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

Enhancing Image Captioning Using Deep Convolutional Generative Adversarial Networks

Improving Multimodal Datasets with Image Captioning

Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Semantic-aware Data Augmentation for Text-to-image Synthesis

Improving Text Generation on Images with Synthetic Captions

Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

Towards Adaptable and Interactive Image Captioning with Data Augmentation and Episodic Memory

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

Learning to Evaluate Image Captioning

Improving face generation quality and prompt following with synthetic captions