A Knowledge Augmented and Multimodal-Based Framework for Video Summarization

Jiehang Xie,Xuanbai Chen,Shao-Ping Lu,Yulu Yang
DOI: https://doi.org/10.1145/3503161.3548089
2022-01-01
Abstract:Video summarization aims to generate a compact version of a lengthy video that retains its primary content. In general, humans are gifted with producing a high-quality video summary, because they acquire crucial content through multiple dimensional information and own abundant background knowledge about the original video. However, existing methods rarely consider multichannel information and ignore the impact of external knowledge, resulting in the limited quality of the generated summaries. This paper proposes a knowledge augmented and multimodal-based video summarization method, termed KAMV, to address the problem above. Specifically, we design a knowledge encoder with a hybrid method consisting of generation and retrieval, to capture descriptive content and latent connections between events and entities based on the external knowledge base, which can provide rich implicit knowledge for better comprehending the video viewed. Furthermore, for the sake of exploring the interactions among visual, audio, implicit knowledge and emphasizing the content that is most relevant to the desired summary, we present a fusion module under the supervision of these multimodal information. By conducting extensive experiments on four public datasets, the results demonstrate the superior performance yielded by the proposed KAMV compared to the state-of-the-art video summarization approaches.
What problem does this paper attempt to address?