Abstract:Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

AGREE: Aligning Cross-Modal Entities for Image-Text Retrieval Upon Vision-Language Pre-trained Models

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Unified Lexical Representation for Interpretable Visual-Language Alignment

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Global and Local Semantic Completion Learning for Vision-Language Pre-training

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Vision-language pre-training via modal interaction

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction