Abstract:There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to establish a unified audio - visual - text (A - V - T) pre - training model in multimodal processing to achieve various multimodal understanding tasks. Specifically, the authors aim to build a model inspired by human cognition, which can process information in three modalities of audio, visual, and text simultaneously, thereby better simulating human auditory, visual, and reading processes. Most of the existing work mainly focuses on processing information in two modalities (such as text and visual or text and audio), while this paper further explores how to effectively integrate information in three modalities to improve the effect of multimodal representation learning. ### Main Problems 1. **Challenges in Multimodal Understanding**: How to effectively model and understand information across three modalities of audio, visual, and text. 2. **Alignment between Modalities**: How to establish an effective alignment relationship between different modalities, especially when there is a natural temporal alignment relationship between the audio and visual modalities. 3. **Mitigation of Modal Gaps**: How to reduce the gaps between different modalities so that the model can learn multimodal representations more effectively. ### Solutions To address the above challenges, the authors propose a model named CoA VT (Correlated Audio - Visual - Text pre - training). The main features of this model include: 1. **Joint Audio - Visual Encoder**: Used to process audio and visual inputs simultaneously and capture audio - visual synchronization information. 2. **Text Encoder**: Used to process text inputs and capture language information. 3. **Query Encoder**: Serves as a bridge to connect the joint audio - visual encoder and the text encoder and extract the most relevant audio - visual features. 4. **Multimodal Alignment Loss**: Optimizes the model's performance in multimodal alignment through contrastive loss, matching loss, and language modeling loss. ### Experimental Results Through experiments on multiple downstream tasks, including text - video retrieval, audio - visual event classification, and audio - visual retrieval, the authors verified the effectiveness of the CoA VT model. The results show that CoA VT achieved new best performances on multiple benchmark datasets, especially in the text - video retrieval task in zero - shot and fine - tuning settings. ### Contributions 1. **Proposed a multimodal pre - training model inspired by human cognition**, which can effectively process information in three modalities of audio, visual, and text. 2. **Introduced a query encoder**, which serves as a bridge, reduces the gaps between different modalities, and enhances the alignment between modalities. 3. **Established bimodal alignments of audio - text and visual - text**, further enhancing the effect of multimodal representation learning. 4. **Achieved state - of - the - art performances on multiple downstream tasks**, especially in the text - video retrieval task. In conclusion, this paper solves key problems in multimodal processing by proposing the CoA VT model and provides new solutions for multimodal understanding tasks.

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

TAVT: Towards Transferable Audio-Visual Text Generation.

Toward a Perceptive Pretraining Framework for Audio-Visual Video Parsing

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Unified Video-Language Pre-training with Synchronized Audio

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

A Study on Joint Modeling and Data Augmentation of Multi-Modalities for Audio-Visual Scene Classification

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization.

Exploring the Role of Audio in Video Captioning

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Multimodal Autoregressive Pre-training of Large Vision Encoders

Cross-modal Pretraining and Matching for Video Understanding

Multimodal Variational Auto-encoder based Audio-Visual Segmentation