VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Sihan Chen,Xingjian He,Longteng Guo,Xinxin Zhu,Weining Wang,Jinhui Tang,Jing Liu

2023-04-17

Abstract:In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page <a class="link-external link-https" href="https://casia-iva-group.github.io/projects/VALOR" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Computation and Language,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the problem of multimodal understanding and generation, particularly in modeling the relationships between the three modalities of vision, audio, and language. Specifically, the paper proposes a model named VALOR (Vision-Audio-Language Omni-Perception Pretraining Model) to achieve cross-modal understanding and generation tasks. Unlike existing vision-language pretraining models, VALOR jointly models the relationships between vision, audio, and language in an end-to-end manner. The main objectives of the paper include: 1. **Establishing General Connections**: Enhancing the connection between vision and language by introducing the audio modality, thereby establishing a more robust tri-modal system. 2. **Designing Pretraining Tasks**: Proposing two pretraining tasks—Multimodal Group Alignment (MGA) and Multimodal Group Captioning (MGC)—to improve the model's generalization ability. 3. **Building High-Quality Datasets**: Constructing a large-scale, high-quality dataset VALOR-1M, which includes 1 million video clips with audio and visual descriptions, to support tri-modal research. 4. **Evaluating Model Performance**: Validating VALOR's performance on various downstream tasks such as retrieval, captioning, and question answering, and demonstrating new records on multiple public benchmarks. Overall, the paper aims to enhance the capabilities of existing vision-language models by introducing the audio modality and to advance the field of multimodal understanding and generation.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Understanding Chinese Video and Language Via Contrastive Multimodal Pre-Training

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Leveraging per Image-Token Consistency for Vision-Language Pre-training

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning