Abstract:Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to automatically generate movie trailers that match the background music. Specifically, the authors propose a method based on the Inverse Partial Optimal Transport (IPOT) framework, aiming to select and reorganize key shots from long videos to generate attractive movie trailers. This process not only needs to consider the selection of visual content but also ensure the semantic consistency and synchronization between the visual content and the background music. ### Main Challenges 1. **Semantic Consistency**: Existing methods, when utilizing background music, mainly focus on synchronizing movie clips according to the music rhythm, neglecting the semantic alignment between visual and auditory information. 2. **Difficulty in Data Annotation**: Current learning methods often rely on detailed annotated data, such as frame-level gaze scores and manually defined emotion labels. Obtaining these annotated data is very time-consuming and costly. 3. **Insufficient Datasets**: Current movie trailer datasets lack the situation where multiple official trailers correspond to the same movie and do not have fine-grained annotation information, leading to models being prone to overfitting. ### Solutions 1. **Inverse Partial Optimal Transport Framework (IPOT)**: - **Multimodal Representation Learning**: Extract latent representations of movie clips and music clips through encoders with a two-tower structure. - **Cross-modal Alignment**: Use attention mechanisms and Sinkhorn matching networks to achieve alignment of visual and auditory latent representations. - **Selection and Ranking**: By learning the distribution of movie clips and the cross-modal optimal transport plan, select and rank key movie clips to generate the trailer. 2. **Constructing a New Dataset (CMTD)**: - **Rich Annotation Information**: Includes multiple official trailers corresponding to the same movie, providing fine-grained segment information and metadata such as subtitles, plot summaries, and turning point annotations. - **Large-scale Data**: Possibly the largest movie trailer dataset with detailed annotations to date, helping to improve the model's generalization ability. ### Contributions 1. **Proposed a novel and effective IPOT framework** for music-guided movie trailer generation, addressing the shortcomings of existing methods in semantic alignment and data annotation. 2. **Constructed a new public comprehensive movie trailer dataset CMTD**, providing rich resources for movie trailer generation and other video understanding tasks. Through these innovations, this research aims to improve the quality and efficiency of automatic movie trailer generation, making it more in line with practical application needs.

An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation

Towards Automated Movie Trailer Generation

Formal Foundations for MARTE-SystemC Interoperability

Edge Data Based Trailer Inception Probabilistic Matrix Factorization for Context-Aware Movie Recommendation

Predicting Movie Trailer Viewer's “like/dislike” Via Learned Shot Editing Patterns

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

From Trailers to Storylines: An Efficient Way to Learn from Movies

Automatically selecting shots for action movie trailers

Film Trailer Generation via Task Decomposition

Learning Trailer Moments in Full-Length Movies

Tracking in Multimedia Data Via Robust Reweighted Local Multi-Task Sparse Representation for Transportation Surveillance

Video Retargeting with Multi-Scale Trajectory Optimization

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding

Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention

Competitive Analysis System for Theatrical Movie Releases Based on Movie Trailer Deep Video Representation

Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer

Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification

Diff-BGM: A Diffusion Model for Video Background Music Generation

Unified Pretraining Target Based Video-music Retrieval With Music Rhythm And Video Optical Flow Information

Exploration of Speech and Music Information for Movie Genre Classification