Abstract:Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at <a class="link-external link-https" href="https://y-ren16.github.io/STAV2A" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve two key problems in the task of video - generated audio: semantic consistency and temporal alignment. Specifically, existing video - to - audio (V2A) generation methods face challenges in generating high - quality, semantically consistent, and temporally aligned audio. These problems mainly stem from: 1. **Information Redundancy**: Videos contain a large amount of information, and it is difficult to extract useful audio - related features. 2. **Lack of Semantic Information**: When generating audio solely relying on video features, there may be a lack of sufficient semantic information, resulting in the generated audio not matching the video content. 3. **Temporal Alignment**: Ensuring that the generated audio is synchronized with video events in time is a difficult point, especially when dealing with complex audio - video scenes. To solve these problems, the paper proposes the Semantic and Temporal Aligned Video - to - Audio (STA - V2A) model. STA - V2A improves existing methods in the following ways: - **Refinement of Local and Global Video Features**: A pre - task is introduced to predict the audio onset, thereby extracting local temporal features; at the same time, an attention pooling module is used to extract global semantic features, reducing the interference of redundant information in video features. - **Formula Representation**: - The loss function for local temporal feature extraction is: \[ L_{\text{onset}} = -\frac{1}{T'} \sum_{i = 1}^{T'} \left[ y_i^a \log(\hat{y}_i^v)+(1 - y_i^a) \log(1 - \hat{y}_i^v) \right] \] where \( y_i^a \) represents the pseudo - label and \( \hat{y}_i^v \) represents the predicted value. - The calculation formula for global semantic feature extraction is: \[ \tilde{e}_{\text{atten}}^v=\sum_{u = 1}^L p(u) \tilde{e}(u)^v \] where \( p(u)\propto\exp(\alpha_l \theta_l(u)+\alpha_c \theta_c(u)) \), \( \theta_l(u) = v_l^T \text{relu}(V_l \tilde{e}(u)^v) \), \( \theta_c(u)=\sum_{i = 1}^L ((W_1 \tilde{e}(u)^v W_1 \tilde{e}(u)^v)^T (W_2 \tilde{e}(i)^v W_2 \tilde{e}(i)^v)) \). - **T2A - Enhanced Cross - Modal Latent Diffusion Model**: Utilize a pre - trained text - to - audio (T2A) model to initialize the diffusion model, and combine text and video features as cross - modal guidance to ensure the high - quality, semantic consistency, and temporal alignment of the generated audio. - **Formula Representation**: - The condition \( c \) of the diffusion model is formed by splicing the text embedding and the global semantic video feature: \[ c = [e_{\text{text}}; e_{\text{gv}}]=[e_1^{\text{text}}, e_2^{\text{text}}, \dots, e_{K_0}^{\text{text}}; e_1^{\text{gv}}, \dots, e_K^{\text{gv}}] \] - The diffusion loss function is: \[ L_{\text{DM}}=\mathbb{E}_{z_0, \epsilon\sim\mathcal{N}(0, I), t\sim\text{Uniform}(1, T)} \left\| \epsilon - \hat{

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Video-to-Audio Generation with Fine-grained Temporal Semantics

Video-to-Audio Generation with Hidden Alignment

Text-to-Audio Generation Synchronized with Videos

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

Tell What You Hear From What You See -- Video to Audio Generation Through Text

Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Efficient Video to Audio Mapper with Visual Scene Detection

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Align, Adapt and Inject: Sound-guided Unified Image Generation

FoleyGen: Visually-Guided Audio Generation

Generating images from audio under semantic consistency

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

TAVGBench: Benchmarking Text to Audible-Video Generation