Diff-BGM: A Diffusion Model for Video Background Music Generation

Sizhe Li,Yiming Qin,Minghang Zheng,Xin Jin,Yang Liu

2024-05-20

Abstract:When editing a video, a piece of attractive background music is indispensable. However, video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video with retrieval precision metrics. Finally, we propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Main Problems Addressed by the Paper This paper primarily addresses the issue of automatic generation of background music for videos, specifically covering the following aspects: 1. **Construction of High-Quality Dataset**: The paper proposes a new high-quality music video dataset called BGM909. This dataset includes detailed annotation information and carefully selected music-video alignment samples for the task of background music generation. 2. **Control Signals and Interpretability**: The paper addresses the difficulty of intuitively using control signals to adjust different aspects of music (such as melody, rhythm, etc.) in existing methods for background music generation, as well as the lack of good interpretability. It proposes a diffusion model-based method called Diff-BGM, which can control different aspects of music generation through various visual features and improve the interpretability of the generation process. 3. **Temporal Alignment**: For the task of generating background music for videos, it is essential to ensure that the music is temporally aligned with the video, i.e., the music rhythm matches the dynamic changes in the video. The Diff-BGM model introduced in the paper incorporates a segment-aware cross-attention layer to improve the temporal alignment between music and video. 4. **Evaluation Metrics**: To better assess the quality of background music and its consistency with the video, the paper also proposes a series of evaluation metrics, including music quality, diversity, and music retrieval accuracy. Through the above methods, the paper aims to improve the quality of background music generation, making it more consistent with video content, and enhancing the controllability and interpretability of the model generation process.

Diff-BGM: A Diffusion Model for Video Background Music Generation

Video Background Music Generation: Dataset, Method and Evaluation

Video Background Music Generation with Controllable Music Transformer

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music Generation

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

Multi-Source Music Generation with Latent Diffusion

A Dataset for Learning Stylistic and Cultural Correlations Between Music and Videos

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

A System For Automatic Generation Of Music Sports-Video

MV-Diffusion: Motion-aware Video Diffusion Model

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

SSVMR: Saliency-Based Self-Training for Video-Music Retrieval.

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

AutoMatch: A Large-scale Audio Beat Matching Benchmark for Boosting Deep Learning Assistant Video Editing

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation