Abstract:Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the cross - modal generation problem in vision - to - audio (V2A) generation. Specifically, the author aims to automatically generate semantically related audio outputs from visual inputs. Traditional V2A generation methods usually design complex systems and need to train multiple modules from scratch using medium - sized datasets, which is not only resource - consuming, but also the generalization ability of each module is limited due to the limited amount of data. To solve this problem, the author proposes a lightweight solution by leveraging foundation models (FMs), especially CLIP, CLAP and AudioLDM, to bridge the domain gap between the visual and auditory modalities. Specifically, they introduce a simple mapping mechanism (V2A - Mapper) for converting visual inputs between CLIP and CLAP spaces. Then, the pre - trained audio - generation foundation model AudioLDM can generate high - quality sounds aligned with the visual inputs based on the converted CLAP embeddings. ### Main contributions: 1. **Explore the potential of FMs in V2A generation**: Research on how to use large - scale pre - trained foundation models to solve the cross - modal generation problem. 2. **Propose V2A - Mapper**: Design a simple but effective mapper for connecting visual and auditory foundation models. 3. **Analyze mapping strategies**: Study generative and regression - based mapping strategies and find that the generative mapper performs better in terms of fidelity and diversity, while the regression - based mapper has a slight edge in terms of relevance. 4. **Performance evaluation**: Through objective and subjective evaluations on two V2A datasets, prove the effectiveness and efficiency of this method. Compared with existing methods, this method uses 86% fewer parameters, but improves fidelity (FD) and relevance (CS) by 53% and 19% respectively. ### Method overview: - **Visual encoder FM (CLIP)**: Used to extract visual features. - **Audio encoder FM (CLAP)**: Used to extract audio features. - **Audio generator FM (AudioLDM)**: Used to generate audio waveforms based on CLAP embeddings. - **Trainable V2A - Mapper**: Used to convert CLIP embeddings into CLAP embeddings, thereby bridging the domain gap between the visual and auditory spaces. Through this method, the author not only reduces the number of parameters and computational resources required for training, but also significantly improves the quality of the generated audio and its relevance to the visual input.

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Efficient Video to Audio Mapper with Visual Scene Detection

Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

FoleyGen: Visually-Guided Audio Generation

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

TAVT: Towards Transferable Audio-Visual Text Generation.

Visual Hallucination Elevates Speech Recognition

Align, Adapt and Inject: Sound-guided Unified Image Generation

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

Video-to-Audio Generation with Hidden Alignment