Abstract:Video summarization has unprecedented importance in facilitating the rapid browsing, retrieval, and comprehension of large numbers of videos. Benefiting from possessing rich prior knowledge of the raw video and the capability to filter less crucial frames by employing multimodal information, humans can condense a lengthy video into a compact and reasonable video summary. However, existing automated video summarization approaches struggle to determine which shots in a video are significant concurrently and robustly, which is detrimental to the generation of high-quality summaries. To improve the quality of video summaries further, drawing inspiration from human abilities, we propose a novel video summarization approach based on a knowledge-aware multimodal network (KAMN). In particular, we present a knowledge-based encoder to obtain the corresponding representation for each frame. This representation is composed of captured descriptive content and affections, which are retrieved from large-scale external knowledge bases. Owing to these knowledge bases, rich implicit knowledge is provided to better understand the viewed video. Moreover, to integrate the visual, audio, and implicit knowledge features more effectively and to identify valuable information across different modalities further, we design a fusion module to learn these multimodal feature relationships more thoroughly. KAMN operates in both unsupervised and supervised training modes. Objective quantitative experiments and subjective user studies were conducted using four publicly available datasets. The results verified the effectiveness of the proposed modules and demonstrated the superior performance yielded by our framework.

DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Hierarchical organization for medical video summarization using latent visual and semantic analysis

Memorable and Rich Video Summarization

Query-Aware Sparse Coding for Multi-Video Summarization

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization.

Query-Oriented Micro-Video Summarization

Query-Biased Self-Attentive Network for Query-Focused Video Summarization

Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Hierarchical multi‐modal video summarization with dynamic sampling

Personalized Video Summarization by Multimodal Video Understanding

Learning Multiscale Hierarchical Attention for Video Summarization

Video summarization via knowledge-aware multimodal deep networks

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Multi-View Video Summarization

Spatial Attention Model‐modulated Bi‐directional Long Short‐term Memory for Unsupervised Video Summarisation

Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder Framework

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization