Abstract:Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at <a class="link-external link-http" href="http://frieren-v2a.github.io" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in the video - to - audio generation (Video - to - Audio, V2A) task: 1. **Audio Quality**: The generated audio should have good perceptual quality, which is a basic requirement for audio generation tasks. 2. **Temporal Alignment**: The generated audio should not only match the video content but also be synchronized with video frames in time. This has a significant impact on the user experience because humans are very sensitive to the temporal consistency of audiovisual information. 3. **Generation Efficiency**: The model should be efficient in terms of generation speed and resource utilization, which is crucial for its practicality in large - scale and high - throughput applications. Specifically, the existing V2A methods have the following deficiencies in these three aspects: - **Audio Quality**: Early GAN - based methods generate audio of poor quality and lack practicality; autoregressive and diffusion models have improved the generation quality but still need further improvement. - **Temporal Alignment**: Autoregressive models have difficulty explicitly aligning the generated audio with the video; Diff - Foley relies on additional classifier guidance to achieve good synchronization, which increases the model complexity and leads to instability when reducing the sampling steps. - **Generation Efficiency**: Autoregressive models have high inference latency; Diff - Foley requires a large number of sampling steps to achieve high - quality generation due to the curved sampling trajectory of the diffusion model, increasing the inference time cost. To solve these problems, the authors introduced the rectified flow matching method and proposed a model named FRIEREN. FRIEREN improves the existing methods in the following ways: - **Rectified Flow Matching**: Use the rectified flow matching method to regress the conditional transport vector field from noise to the spectrogram latent space and sample by solving ODEs, thereby achieving higher audio quality and diversity. - **Non - autoregressive Vector Field Estimator**: Adopt a non - autoregressive vector field estimator with a feed - forward Transformer structure without downsampling in the time dimension to maintain the time resolution. - **Channel - level Cross - modal Feature Fusion**: Utilize the channel - level cross - modal feature fusion mechanism for conditioning, which enhances the inherent alignment of audio - video data and achieves strong temporal alignment. - **Re - flow and One - step Distillation Techniques**: By integrating the re - flow and one - step distillation techniques, FRIEREN can generate high - quality audio in a few or even just one sampling step, significantly improving the generation efficiency. The experimental results show that FRIEREN achieves state - of - the - art performance on the VGGSound dataset and outperforms existing methods in terms of generation quality, temporal and alignment accuracy. Specifically, FRIEREN improves the inception score by 6.2% over the strong baseline model and achieves a 97.22% temporal alignment accuracy within 25 sampling steps. Moreover, after combining the re - flow and distillation techniques, FRIEREN can achieve a temporal alignment accuracy of up to 97.85% in just one step, accelerating by 9.3 times. In summary, this paper aims to significantly improve the audio quality, temporal alignment, and generation efficiency in the video - to - audio generation task by introducing the rectified flow matching method and a series of technological innovations.

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

LoVA: Long-form Video-to-Audio Generation

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

FoleyGen: Visually-Guided Audio Generation

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

Video-to-Audio Generation with Fine-grained Temporal Semantics

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Efficient Video to Audio Mapper with Visual Scene Detection

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

FAVER: Blind quality prediction of variable frame rate videos

Video-to-Audio Generation with Hidden Alignment

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer