Abstract:There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

Distilling Vision-Language Models on Millions of Videos

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

CapsFusion: Rethinking Image-Text Data at Scale

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Learning Video-Text Aligned Representations for Video Captioning

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Contrastive Video-Language Learning with Fine-grained Frame Sampling

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment