Abstract:Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples. These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Vision - Language Pretraining (VLP) models have limited capabilities in fine - grained understanding. Specifically, current VLP models mainly focus on capturing the overall relationships between visual and language modalities while ignoring more subtle local interactions. This limitation makes these models perform poorly in tasks that require detailed perception, such as applications in fields like medicine, agriculture, and e - commerce. To address this issue, the author introduced a new method - Negative Augmented Samples (NAS). By innovatively combining negative sample augmentation techniques, it enhances the fine - grained understanding ability. NAS uses the Visual Dictionary (VD) as a semantic bridge between the visual and language domains and adopts the VD - based Negative Visual Augmentation (NVA) method to generate challenging negative image samples. These samples differ from positive samples only at the token level, forcing the model to more precisely distinguish the subtle differences between positive and negative samples. ### Main Contributions 1. **Proposing the NTVA method**: Simultaneously constructing difficult negative text and negative visual samples, significantly enhancing the fine - grained understanding ability. 2. **Introducing the NAS model**: Applying the NTVA method to the VLP model, significantly improving its fine - grained understanding ability. 3. **Experimental verification**: Through experiments on datasets such as ARO, Winoground, and VALSE, the effectiveness of NAS is proven, and new State of the Art (SOTA) results are achieved on these datasets. ### Specific Implementation of the Solution - **Model Architecture**: The NAS model includes two - stage pre - training. In the first stage, quantized image embeddings and encoded text embeddings are integrated through the cross - attention mechanism; in the second stage, the NVA module generates token - level negative image samples, which are input together with positive samples into the multi - modal encoder for training. - **Negative Visual Augmentation Module (NVA)**: Use VD to quantize continuous visual features and replace object embeddings based on global and local feature similarities to generate negative image samples. - **Pre - training Tasks**: - Fine - grained Image - Text Matching (FGITM): Predict whether a given image - text pair matches. - Image - Text Contrastive Learning (ITC): Calculate the similarity between images and texts. - Masked Language Modeling (MLM): Use images and context texts to predict masked words. Through these improvements, the NAS model can exhibit higher accuracy and robustness in fine - grained tasks, solving the deficiencies of existing VLP models in fine - grained understanding.

Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

Neighbor Does Matter: Curriculum Global Positive-Negative Sampling for Vision-Language Pre-training

Noise-robust Vision-language Pre-training with Positive-negative Learning

ViLTA: Enhancing Vision-Language Pre-training Through Textual Augmentation

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Leveraging per Image-Token Consistency for Vision-Language Pre-training

NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

VLP: A Survey on Vision-language Pre-training

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Enhancing Vision-Language Few-Shot Adaptation with Negative Learning

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

LARE: Latent Augmentation using Regional Embedding with Vision-Language Model

Retrieval-based Knowledge Augmented Vision Language Pre-training

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Visually-Augmented Language Modeling

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

3D Vision and Language Pretraining with Large-Scale Synthetic Data

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training