MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding

Jiajie Teng,Huiyu Duan,Yucheng Zhu,Sijing Wu,Guangtao Zhai

2024-05-15

Abstract:Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.

Multimedia,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the automatic recommendation of short - sighted background music. Specifically, with the rapid development of short - videos, background music plays an important role in the emotional expression of short - videos, as well as the understanding and experience of the audience. However, currently, the background music of short - videos is mainly selected by video producers, lacking effective automatic music recommendation methods. This not only consumes a great deal of time and energy of video producers, but also makes it difficult to efficiently select appropriate background music for different video clips when facing a large amount of music and video data. Therefore, it becomes particularly important to develop a music recommendation system that can improve the production efficiency of short - videos. The paper proposes a self - supervised Music - Video embedding space Binding model named MVBind, aiming to achieve cross - modal retrieval between music and video modalities. MVBind directly obtains the relational knowledge between modalities from data in a self - supervised manner without manual annotation. In addition, in order to make up for the lack of corresponding music - visual paired datasets, the researchers constructed a dataset SVM - 10K containing nearly 10,000 carefully selected short - videos. The experimental results show that on the SVM - 10K dataset, MVBind shows a significant performance improvement compared with other baseline methods.

MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding

SSVMR: Saliency-Based Self-Training for Video-Music Retrieval.

Unsupervised Teacher-Student Model for Large-Scale Video Retrieval.

A Dataset for Learning Stylistic and Cultural Correlations Between Music and Videos

Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Learning Music Embedding with Metadata for Context Aware Recommendation

Background Music Recommendation on Short Video Sharing Platforms

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Video-Music Retrieval:A Dual-Path Cross-Modal Network

Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing

Learning to Embed Music and Metadata for Context-Aware Music Recommendation

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Video Background Music Generation: Dataset, Method and Evaluation

Video to Music Moment Retrieval

Unified Pretraining Target Based Video-music Retrieval With Music Rhythm And Video Optical Flow Information

Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Music Recommendation Via Heterogeneous Information Graph Embedding.

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Cross-modal Embeddings for Video and Audio Retrieval