VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Yan-Bo Lin,Yu Tian,Linjie Yang,Gedas Bertasius,Heng Wang
2024-09-12
Abstract:We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at <a class="link-external link-https" href="https://genjib.github.io/project_page/VMAs/index.html" rel="external noopener nofollow">this https URL</a>
Multimedia,Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically generate background music from videos. Specifically, the authors propose a new framework named VMA S (Video - to - Music Generation via Semantic Alignment in Web Music Videos), aiming to generate background music that highly matches the video content through large - scale web video data. The solution to this problem can significantly reduce the time and expertise required for content creators to match videos with background music and improve the attractiveness and viewing experience of videos. ### Main Contributions 1. **Large - scale Dataset**: The paper introduces a new dataset named DISCO - MV, which contains 2.2 million video - music samples and is several orders of magnitude larger than any existing dataset for video - music generation. This enables the model to learn more rich and diverse musical styles, thereby generating more realistic and diverse music. 2. **Semantic Alignment Scheme**: A novel video - music alignment scheme is proposed, which encourages the generated music to be aligned with the high - level content of the video (such as video type, style, etc.) through joint autoregressive and contrastive learning objectives. 3. **Video Beat Alignment**: A video beat alignment scheme is introduced to make the generated music beats match the low - level dynamic content in the video (such as scene transitions, character actions, etc.). 4. **Efficient Video Encoder**: An efficient video encoder architecture is developed, which can process a large number of densely sampled video frames, capture fine - grained spatio - temporal cues, and thus generate more accurate background music. ### Technical Details - **Audio Input Processing**: EnCodec is used to convert the continuous audio stream into a series of discrete audio tokens, which are used as supervision signals for the video - to - music generation model. - **Video Input Processing**: The video input is uniformly sampled and aligned with the corresponding audio segments. The video encoder is modified based on the Hiera architecture to efficiently process high - frame - rate videos. - **Autoregressive Music Generation**: A standard Transformer architecture is adopted as an autoregressive music decoder to generate music using video features. - **Semantic Video - Music Alignment**: Through the global video - music contrastive objective and the video beat alignment scheme, the high - alignment between the generated music and the video content is ensured. - **Training Objectives**: The contrastive learning and autoregressive generation objectives are combined, and the importance of both is adjusted by the balance term \(\beta\). ### Experimental Results - **Performance Evaluation**: On the MusicCaps and DISCO - MV datasets, VMA S outperforms existing methods in multiple evaluation metrics (such as FAD, KL, music - video alignment, etc.). - **Human Evaluation**: Through human evaluation, participants unanimously believe that the music generated by VMA S is superior to other methods in terms of overall quality and music - video synchronization. In conclusion, this paper significantly improves the quality and diversity of automatically generating background music from videos by introducing large - scale datasets and innovative alignment schemes.