Abstract:Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Iterative Task-adaptive Pretraining for Unsupervised Word Alignment

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

MULTI-LEVEL CONTRASTIVE LEARNING FOR CROSS-LINGUAL ALIGNMENT

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Accommodating Audio Modality in CLIP for Multimodal Processing

CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation

Vision-Language Pre-Training with Triple Contrastive Learning

Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition

Supervised Contrastive Learning for Cross-Lingual Transfer Learning

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Natural Language Supervision for General-Purpose Audio Representations

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Improving Multi-lingual Alignment Through Soft Contrastive Learning