Abstract:There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

Cross-Modal Coherence for Text-to-Image Retrieval

Image–text coherence and its implications for multimodal AI

Exploring coherence from heterogeneous representations for OCR image captioning

Cross-Modal Image-Text Retrieval with Semantic Consistency

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

Cross-modal alignment with graph reasoning for image-text retrieval

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Composing Object Relations and Attributes for Image-Text Matching

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

Modality-dependent Cross-media Retrieval

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Semantic Modeling of Textual Relationships in Cross-modal Retrieval

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning

MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval