Abstract:Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

Aligner²: Enhancing Joint Multiple Intent Detection and Slot Filling Via Adjustive and Forced Cross-Task Alignment

Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

Multi-level multilingual semantic alignment for zero-shot cross-lingual transfer learning

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

MCSFF: Multi-modal Consistency and Specificity Fusion Framework for Entity Alignment

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Learning Semantic Alignment Using Global Features and Multi-scale Confidence

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

Cross-modal Alignment with Optimal Transport for CTC-based ASR

Multi-Modal Fusion-Based Multi-Task Semantic Communication System

Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition