Abstract:Video summarization has unprecedented importance in facilitating the rapid browsing, retrieval, and comprehension of large numbers of videos. Benefiting from possessing rich prior knowledge of the raw video and the capability to filter less crucial frames by employing multimodal information, humans can condense a lengthy video into a compact and reasonable video summary. However, existing automated video summarization approaches struggle to determine which shots in a video are significant concurrently and robustly, which is detrimental to the generation of high-quality summaries. To improve the quality of video summaries further, drawing inspiration from human abilities, we propose a novel video summarization approach based on a knowledge-aware multimodal network (KAMN). In particular, we present a knowledge-based encoder to obtain the corresponding representation for each frame. This representation is composed of captured descriptive content and affections, which are retrieved from large-scale external knowledge bases. Owing to these knowledge bases, rich implicit knowledge is provided to better understand the viewed video. Moreover, to integrate the visual, audio, and implicit knowledge features more effectively and to identify valuable information across different modalities further, we design a fusion module to learn these multimodal feature relationships more thoroughly. KAMN operates in both unsupervised and supervised training modes. Objective quantitative experiments and subjective user studies were conducted using four publicly available datasets. The results verified the effectiveness of the proposed modules and demonstrated the superior performance yielded by our framework.

Hierarchical multi‐modal video summarization with dynamic sampling

A Novel Compact Yet Rich Key Frame Creation Method for Compressed Video Summarization

A Human-Machine Collaborative Video Summarization Framework Using Pupillary Response Signals

Memorable and Rich Video Summarization

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

Creating Personalized Video Summaries Via Semantic Event Detection

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Hierarchical organization for medical video summarization using latent visual and semantic analysis

MHMS: Multimodal Hierarchical Multimedia Summarization

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

Learning Multiscale Hierarchical Attention for Video Summarization

Automatically Generating Hierarchical Summary for Film Video

Multi-View Video Summarization

Hierarchical multimodal transformer to summarize videos

Multi-video Summarization Using Complex Graph Clustering and Mining.

Video summarization via knowledge-aware multimodal deep networks

DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization

From Coarse to Fine: Hierarchical Structure-aware Video Summarization

Hierarchical3D Adapters for Long Video-to-text Summarization

Perceptual Attributes Optimization For Multivideo Summarization

An Effective Video Summarization Framework Toward Handheld Devices