V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Heng Wang,Jianbo Ma,Santiago Pascual,Richard Cartwright,Weidong Cai
2023-12-14
Abstract:Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the cross - modal generation problem in vision - to - audio (V2A) generation. Specifically, the author aims to automatically generate semantically related audio outputs from visual inputs. Traditional V2A generation methods usually design complex systems and need to train multiple modules from scratch using medium - sized datasets, which is not only resource - consuming, but also the generalization ability of each module is limited due to the limited amount of data. To solve this problem, the author proposes a lightweight solution by leveraging foundation models (FMs), especially CLIP, CLAP and AudioLDM, to bridge the domain gap between the visual and auditory modalities. Specifically, they introduce a simple mapping mechanism (V2A - Mapper) for converting visual inputs between CLIP and CLAP spaces. Then, the pre - trained audio - generation foundation model AudioLDM can generate high - quality sounds aligned with the visual inputs based on the converted CLAP embeddings. ### Main contributions: 1. **Explore the potential of FMs in V2A generation**: Research on how to use large - scale pre - trained foundation models to solve the cross - modal generation problem. 2. **Propose V2A - Mapper**: Design a simple but effective mapper for connecting visual and auditory foundation models. 3. **Analyze mapping strategies**: Study generative and regression - based mapping strategies and find that the generative mapper performs better in terms of fidelity and diversity, while the regression - based mapper has a slight edge in terms of relevance. 4. **Performance evaluation**: Through objective and subjective evaluations on two V2A datasets, prove the effectiveness and efficiency of this method. Compared with existing methods, this method uses 86% fewer parameters, but improves fidelity (FD) and relevance (CS) by 53% and 19% respectively. ### Method overview: - **Visual encoder FM (CLIP)**: Used to extract visual features. - **Audio encoder FM (CLAP)**: Used to extract audio features. - **Audio generator FM (AudioLDM)**: Used to generate audio waveforms based on CLAP embeddings. - **Trainable V2A - Mapper**: Used to convert CLIP embeddings into CLAP embeddings, thereby bridging the domain gap between the visual and auditory spaces. Through this method, the author not only reduces the number of parameters and computational resources required for training, but also significantly improves the quality of the generated audio and its relevance to the visual input.