Unsupervised video summarization with adversarial graph-based attention network

Jeshmitha Gunuganti,Zhi-Ting Yeh,Jenq-Haur Wang,Mehdi Norouzi
DOI: https://doi.org/10.1016/j.jvcir.2024.104200
IF: 2.887
2024-06-16
Journal of Visual Communication and Image Representation
Abstract:Video summarization aims to select a subset of video segments that best capture the video storyline. Our study seeks to train an encoder to transform the raw frame features extracted from pre-trained CNN models into representations that embody importance and guide the selection of the video segment. Our main idea is to use graph modeling and attention mechanisms to train the encoder adversarially. The graph representation enables the model to learn the relationship among frames, revealing the intrinsic structure of a video. The attention mechanism allows the model to capture the magnitude of these relationships. In the proposed model, an attention-based encoder is trained using a graph-based generator that reconstructs videos using the encoded features and a discriminator that guides the generator, distinguishing the original and reconstructed video. Thus, by leveraging graph attention and refining mechanisms, the proposed model offers distinct advantages over existing methods, including enhanced summarization accuracy, improved preservation of temporal coherence, and the ability to capture complex semantic linkages within video content. These advancements are substantiated through a comprehensive ablation study, which demonstrates the efficacy of our model using various evaluation metrics - Kendall and Spearman coefficients. The proposed model is evaluated on TVSum and SumMe datasets and achieves results on par with supervised models that used similar encoders and achieved state-of-the-art results compared to other unsupervised models.
computer science, information systems, software engineering
What problem does this paper attempt to address?