RobustMixGen: Data augmentation for enhancing robustness of visual-language models in the presence of distribution shift

Sunwoo Kim,Hun Im,Woojun Lee,Seonggye Lee,Pilsung Kang
DOI: https://doi.org/10.1016/j.neucom.2024.129167
IF: 6
2024-12-15
Neurocomputing
Abstract:With the increasing deployment of Vision-Language Models (VLMs) in real-world applications, there is growing interest in enhancing their robustness to noise. Data augmentation has emerged as a prominent approach for improving robustness, and in the context of VLMs, MixGen has been widely adopted. Despite its success in improving performance, our experiments indicate that MixGen significantly degrades performance under distribution shift conditions, primarily due to the model's reliance on spurious correlations induced by MixGen-augmented data. To address this limitation, we propose a novel augmentation method that enhances both model performance and robustness by mitigating the learning of spurious correlations. Our approach involves the pre-classification of object and background categories. For image synthesis, we introduce the CutMixup technique, while for text synthesis, we employ a conjunction concatenation strategy, both aimed at reducing the impact of spurious correlations. We evaluated the efficacy of our method using the COCO dataset, a large-scale benchmark comprising images and text. The effectiveness of our approach was assessed in a retrieval task under simulated distribution shift conditions. Our experimental results demonstrate the superiority of the proposed method, with a 17.11% improvement in the robustness metric (MMI) under distribution shift scenarios, establishing it as a more effective data augmentation technique. We would like to broaden the applicability of the augmentation method to various vision-language tasks beyond retrieval.
computer science, artificial intelligence
What problem does this paper attempt to address?