VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Sihan Chen,Xingjian He,Longteng Guo,Xinxin Zhu,Weining Wang,Jinhui Tang,Jing Liu
2023-04-17
Abstract:In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page <a class="link-external link-https" href="https://casia-iva-group.github.io/projects/VALOR" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the problem of multimodal understanding and generation, particularly in modeling the relationships between the three modalities of vision, audio, and language. Specifically, the paper proposes a model named VALOR (Vision-Audio-Language Omni-Perception Pretraining Model) to achieve cross-modal understanding and generation tasks. Unlike existing vision-language pretraining models, VALOR jointly models the relationships between vision, audio, and language in an end-to-end manner. The main objectives of the paper include: 1. **Establishing General Connections**: Enhancing the connection between vision and language by introducing the audio modality, thereby establishing a more robust tri-modal system. 2. **Designing Pretraining Tasks**: Proposing two pretraining tasks—Multimodal Group Alignment (MGA) and Multimodal Group Captioning (MGC)—to improve the model's generalization ability. 3. **Building High-Quality Datasets**: Constructing a large-scale, high-quality dataset VALOR-1M, which includes 1 million video clips with audio and visual descriptions, to support tri-modal research. 4. **Evaluating Model Performance**: Validating VALOR's performance on various downstream tasks such as retrieval, captioning, and question answering, and demonstrating new records on multiple public benchmarks. Overall, the paper aims to enhance the capabilities of existing vision-language models by introducing the audio modality and to advance the field of multimodal understanding and generation.