M3ixup: A Multi-Modal Data Augmentation Approach for Image Captioning
Yinan Li,Jiayi Ji,Xiaoshuai Sun,Yiyi Zhou,Yunpeng Luo,Rongrong Ji
DOI: https://doi.org/10.1016/j.patcog.2024.110941
IF: 8
2024-01-01
Pattern Recognition
Abstract:Despite the great success, most models in image captioning (IC) are still stuck in the dilemma of generating simple and non-discriminative captions. In this paper, we study this problem from the perspective of data augmentation and propose a novel method called Multi-modal Mixup (M3ixup). Compared with the original Mixup strategy designed for image classification, the proposed M3ixup has three novel designs to mix IC samples from the aspects of visual features, sentence embeddings and loss values, respectively. In practice, M3ixup can not only enrich the diversity of IC training data, but also enforce the model to focus more on visual information for captioning, thereby alleviating the negative effect of dataset bias and addressing the issue of simple captioning. To validate M3ixup, we apply it to three baseline models and conduct extensive experiments on MS COCO. The experimental results demonstrate that our proposed M3ixup can not only improve the discriminability and quality of generated captions, but also help the baseline models obtain obvious performance gains, i.e., improving the CIDEr scores of the state-of-the-art model from 133.8 to 135.3 on off-line testing and 135.4 to 137.1 on online testing.
What problem does this paper attempt to address?