STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Yong Ren,Chenxing Li,Manjie Xu,Wei Liang,Yu Gu,Rilin Chen,Dong Yu
2024-09-13
Abstract:Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at <a class="link-external link-https" href="https://y-ren16.github.io/STAV2A" rel="external noopener nofollow">this https URL</a>.
Sound,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve two key problems in the task of video - generated audio: semantic consistency and temporal alignment. Specifically, existing video - to - audio (V2A) generation methods face challenges in generating high - quality, semantically consistent, and temporally aligned audio. These problems mainly stem from: 1. **Information Redundancy**: Videos contain a large amount of information, and it is difficult to extract useful audio - related features. 2. **Lack of Semantic Information**: When generating audio solely relying on video features, there may be a lack of sufficient semantic information, resulting in the generated audio not matching the video content. 3. **Temporal Alignment**: Ensuring that the generated audio is synchronized with video events in time is a difficult point, especially when dealing with complex audio - video scenes. To solve these problems, the paper proposes the Semantic and Temporal Aligned Video - to - Audio (STA - V2A) model. STA - V2A improves existing methods in the following ways: - **Refinement of Local and Global Video Features**: A pre - task is introduced to predict the audio onset, thereby extracting local temporal features; at the same time, an attention pooling module is used to extract global semantic features, reducing the interference of redundant information in video features. - **Formula Representation**: - The loss function for local temporal feature extraction is: \[ L_{\text{onset}} = -\frac{1}{T'} \sum_{i = 1}^{T'} \left[ y_i^a \log(\hat{y}_i^v)+(1 - y_i^a) \log(1 - \hat{y}_i^v) \right] \] where \( y_i^a \) represents the pseudo - label and \( \hat{y}_i^v \) represents the predicted value. - The calculation formula for global semantic feature extraction is: \[ \tilde{e}_{\text{atten}}^v=\sum_{u = 1}^L p(u) \tilde{e}(u)^v \] where \( p(u)\propto\exp(\alpha_l \theta_l(u)+\alpha_c \theta_c(u)) \), \( \theta_l(u) = v_l^T \text{relu}(V_l \tilde{e}(u)^v) \), \( \theta_c(u)=\sum_{i = 1}^L ((W_1 \tilde{e}(u)^v W_1 \tilde{e}(u)^v)^T (W_2 \tilde{e}(i)^v W_2 \tilde{e}(i)^v)) \). - **T2A - Enhanced Cross - Modal Latent Diffusion Model**: Utilize a pre - trained text - to - audio (T2A) model to initialize the diffusion model, and combine text and video features as cross - modal guidance to ensure the high - quality, semantic consistency, and temporal alignment of the generated audio. - **Formula Representation**: - The condition \( c \) of the diffusion model is formed by splicing the text embedding and the global semantic video feature: \[ c = [e_{\text{text}}; e_{\text{gv}}]=[e_1^{\text{text}}, e_2^{\text{text}}, \dots, e_{K_0}^{\text{text}}; e_1^{\text{gv}}, \dots, e_K^{\text{gv}}] \] - The diffusion loss function is: \[ L_{\text{DM}}=\mathbb{E}_{z_0, \epsilon\sim\mathcal{N}(0, I), t\sim\text{Uniform}(1, T)} \left\| \epsilon - \hat{