An Unsupervised Video Summarization Method Based on Multimodal Representation.

Zhuo Lei,Qiang Yu,Lidan Shou,Shengquan Li,Yunqing Mao
DOI: https://doi.org/10.1007/978-981-99-4761-4_15
2023-01-01
Abstract:A good video summary should convey the whole story and feature the most important content. However, the importance of video content is often subjective, and users should have the option to personalize the summary by using natural language to specify what is important to them. Moreover, existing methods usually apply only visual cues to solve generic video summarization tasks, while this work introduces a single unsupervised multi-modal framework for addressing both generic and query-focused video summarization. We use a multi-head attention model to represent the multi-modal feature. We apply a Transformer-based model to learn the frame scores based on their representative, diversity and reconstruction losses. Especially, we develop a novel representative loss to train the model based on both visual and semantic information. We outperform previous state-of-the-art work with superior results on both generic and query-focused video summarization datasets.
What problem does this paper attempt to address?