Abstract:There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

N\"UWA: Visual Synthesis Pre-training for Neural visUal World creAtion

OAW-GAN: Occlusion-Aware Warping GAN for Unified Human Video Synthesis

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Unified Vision-Language Pre-Training for Image Captioning and VQA

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

All in One: Exploring Unified Video-Language Pre-training

Emu: Generative Pretraining in Multimodality

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Unified Video-Language Pre-training with Synchronized Audio

Imagination-Augmented Natural Language Understanding

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Multimodal Pre-training Method for Vision-language Understanding and Generation.