CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Xianghu Yue,Xiaohai Tian,Lu Lu,Malu Zhang,Zhizheng Wu,Haizhou Li
2024-02-21
Abstract:There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.
Audio and Speech Processing,Multimedia,Sound,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to establish a unified audio - visual - text (A - V - T) pre - training model in multimodal processing to achieve various multimodal understanding tasks. Specifically, the authors aim to build a model inspired by human cognition, which can process information in three modalities of audio, visual, and text simultaneously, thereby better simulating human auditory, visual, and reading processes. Most of the existing work mainly focuses on processing information in two modalities (such as text and visual or text and audio), while this paper further explores how to effectively integrate information in three modalities to improve the effect of multimodal representation learning. ### Main Problems 1. **Challenges in Multimodal Understanding**: How to effectively model and understand information across three modalities of audio, visual, and text. 2. **Alignment between Modalities**: How to establish an effective alignment relationship between different modalities, especially when there is a natural temporal alignment relationship between the audio and visual modalities. 3. **Mitigation of Modal Gaps**: How to reduce the gaps between different modalities so that the model can learn multimodal representations more effectively. ### Solutions To address the above challenges, the authors propose a model named CoA VT (Correlated Audio - Visual - Text pre - training). The main features of this model include: 1. **Joint Audio - Visual Encoder**: Used to process audio and visual inputs simultaneously and capture audio - visual synchronization information. 2. **Text Encoder**: Used to process text inputs and capture language information. 3. **Query Encoder**: Serves as a bridge to connect the joint audio - visual encoder and the text encoder and extract the most relevant audio - visual features. 4. **Multimodal Alignment Loss**: Optimizes the model's performance in multimodal alignment through contrastive loss, matching loss, and language modeling loss. ### Experimental Results Through experiments on multiple downstream tasks, including text - video retrieval, audio - visual event classification, and audio - visual retrieval, the authors verified the effectiveness of the CoA VT model. The results show that CoA VT achieved new best performances on multiple benchmark datasets, especially in the text - video retrieval task in zero - shot and fine - tuning settings. ### Contributions 1. **Proposed a multimodal pre - training model inspired by human cognition**, which can effectively process information in three modalities of audio, visual, and text. 2. **Introduced a query encoder**, which serves as a bridge, reduces the gaps between different modalities, and enhances the alignment between modalities. 3. **Established bimodal alignments of audio - text and visual - text**, further enhancing the effect of multimodal representation learning. 4. **Achieved state - of - the - art performances on multiple downstream tasks**, especially in the text - video retrieval task. In conclusion, this paper solves key problems in multimodal processing by proposing the CoA VT model and provides new solutions for multimodal understanding tasks.