Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

Yeyuan Wang,Dehong Gao,Lei Yi,Linbo Jin,Jinxia Zhang,Libin Yang,Xiaoyan Cai
2024-12-13
Abstract:Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples. These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Vision - Language Pretraining (VLP) models have limited capabilities in fine - grained understanding. Specifically, current VLP models mainly focus on capturing the overall relationships between visual and language modalities while ignoring more subtle local interactions. This limitation makes these models perform poorly in tasks that require detailed perception, such as applications in fields like medicine, agriculture, and e - commerce. To address this issue, the author introduced a new method - Negative Augmented Samples (NAS). By innovatively combining negative sample augmentation techniques, it enhances the fine - grained understanding ability. NAS uses the Visual Dictionary (VD) as a semantic bridge between the visual and language domains and adopts the VD - based Negative Visual Augmentation (NVA) method to generate challenging negative image samples. These samples differ from positive samples only at the token level, forcing the model to more precisely distinguish the subtle differences between positive and negative samples. ### Main Contributions 1. **Proposing the NTVA method**: Simultaneously constructing difficult negative text and negative visual samples, significantly enhancing the fine - grained understanding ability. 2. **Introducing the NAS model**: Applying the NTVA method to the VLP model, significantly improving its fine - grained understanding ability. 3. **Experimental verification**: Through experiments on datasets such as ARO, Winoground, and VALSE, the effectiveness of NAS is proven, and new State of the Art (SOTA) results are achieved on these datasets. ### Specific Implementation of the Solution - **Model Architecture**: The NAS model includes two - stage pre - training. In the first stage, quantized image embeddings and encoded text embeddings are integrated through the cross - attention mechanism; in the second stage, the NVA module generates token - level negative image samples, which are input together with positive samples into the multi - modal encoder for training. - **Negative Visual Augmentation Module (NVA)**: Use VD to quantize continuous visual features and replace object embeddings based on global and local feature similarities to generate negative image samples. - **Pre - training Tasks**: - Fine - grained Image - Text Matching (FGITM): Predict whether a given image - text pair matches. - Image - Text Contrastive Learning (ITC): Calculate the similarity between images and texts. - Masked Language Modeling (MLM): Use images and context texts to predict masked words. Through these improvements, the NAS model can exhibit higher accuracy and robustness in fine - grained tasks, solving the deficiencies of existing VLP models in fine - grained understanding.