Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

Tomoya Sugihara,Shuntaro Masuda,Ling Xiao,Toshihiko Yamasaki
2024-08-20
Abstract:Current video summarization methods rely heavily on supervised computer vision techniques, which demands time-consuming and subjective manual annotations. To overcome these limitations, we investigated self-supervised video summarization. Inspired by the success of Large Language Models (LLMs), we explored the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task. By leveraging the advantages of LLMs in context understanding, we aim to enhance the effectiveness of self-supervised video summarization. Our method begins by generating captions for individual video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the captions and the text summary. Notably, we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames with captions similar to the text summary. Our method achieves state-of-the-art performance on the SumMe dataset in rank correlation coefficients. In addition, our method has a novel feature of being able to achieve personalized summarization.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in the task of video summarization. Current video summarization methods heavily rely on supervised computer vision techniques, which require time-consuming and subjective manual annotations. To overcome these limitations, the authors investigate self-supervised video summarization methods and, inspired by the success of large language models (LLMs), explore the possibility of transforming the video summarization task into a natural language processing (NLP) task. Specifically, the method is implemented through the following steps: 1. **Generate Descriptions**: First, a pre-trained image captioning model (such as Generative Image-to-Text, GIT) is used to generate descriptive captions for each video frame. 2. **Text Summary Synthesis**: Then, GPT-4 is utilized to synthesize these captions into a coherent text summary. 3. **Semantic Distance Measurement**: Next, the semantic distance between each frame's caption and the text summary is calculated. For this purpose, a new loss function—Preserving Diversity Loss (PDL)—is proposed to optimize the model based on the video's diversity. 4. **Video Summary Generation**: Finally, frames with high similarity to the text summary are selected to generate the video summary. Additionally, this method achieves state-of-the-art performance in terms of rank correlation coefficient on the SumMe dataset. Moreover, the method has the capability of personalized summarization, allowing the generation of specific video summaries based on user requirements. Through this approach, the paper addresses the need for a large amount of manually annotated data in traditional supervised video summarization methods and improves the quality and efficiency of video summaries.