Abstract:An automated process that can suggest a soundtrack to a user-generated video (UGV) and make the UGV a music-compliant professional-like video is challenging but desirable. To this end, this paper presents an automatic music video (MV) generation system that conducts soundtrack recommendation and video editing simultaneously. Given a long UGV, it is first divided into a sequence of fixed-length short (e.g., 2 seconds) segments, and then a multi-task deep neural network (MDNN) is applied to predict the pseudo acoustic (music) features (or called the pseudo song) from the visual (video) features of each video segment. In this way, the distance between any pair of video and music segments of same length can be computed in the music feature space. Second, the sequence of pseudo acoustic (music) features of the UGV and the sequence of the acoustic (music) features of each music track in the music collection are temporarily aligned by the dynamic time warping (DTW) algorithm with a pseudo-song-based deep similarity matching (PDSM) metric. Third, for each music track, the video editing module selects and concatenates the segments of the UGV based on the target and concatenation costs given by a pseudo-song-based deep concatenation cost (PDCC) metric according to the DTW-aligned result to generate a music-compliant professional-like video. Finally, all the generated MVs are ranked, and the best MV is recommended to the user. The MDNN for pseudo song prediction and the PDSM and PDCC metrics are trained by an annotated official music video (OMV) corpus. The results of objective and subjective experiments demonstrate that the proposed system performs well and can generate appealing MVs with better viewing and listening experiences.

Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music Generation

audeosynth: music-driven video montage

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

Video Background Music Generation: Dataset, Method and Evaluation

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Foley Music: Learning to Generate Music from Videos

Video Background Music Generation with Controllable Music Transformer

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Music Conditioned Generation for Human-Centric Video

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

A Music-Driven System for Generating Apparel Display Video

Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer

Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing

Video-to-Audio Generation with Hidden Alignment

Diff-BGM: A Diffusion Model for Video Background Music Generation

Serenade: A Model for Human-in-the-loop Automatic Chord Estimation

Audeo: Audio Generation for a Silent Performance Video

Visual to Sound: Generating Natural Sound for Videos in the Wild