Abstract:Video summarization facilitates rapid browsing and efficient video indexing in many video browsing website applications, such as sport video highlights, dynamic video cover. In these applications, it is most important to generate user video summaries that capture interesting video content that users prefer. While many existing methods generate video summaries based on low-level features, this paper first proposes to mine large-scale Flickr images and find "interest" and "non-interest" images from Flickr for the same query to learn what is of interest to users. Unlike existing pairwise ranking-based methods for video summarization, we then propose an improved triplet deep ranking model that is easier to converge to learn the relationship between "interest" and "non-interest" Flickr images, and exploit what visual content of the original video is indeed preferred by users. In the training process, triplets (interest image p+, interest image p '+, non-interest image p '') are selected as input to train a model with three parallel deep convolutional networks. In the video summarization process, an efficient entropy-based video segmentation method is proposed for dividing the original video into segments and the visual interest scores of the segments are estimated using the trained ranking network for summarization (SumNet). Then, an optimal subset of the segments is selected to create a summary capturing interesting visual content. We evaluate and compare our method with several state-of-the-art methods, experimental results show that our method achieves an improvement over the best baseline method by 9.6% in terms of mean Average Precision (mAP) accuracy.

Learning Multiscale Hierarchical Attention for Video Summarization

A Human-Machine Collaborative Video Summarization Framework Using Pupillary Response Signals

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

Hierarchical organization for medical video summarization using latent visual and semantic analysis

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

A GAN Based Video Summarization Method with Representation Loss

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

Supervised Video Summarization via Multiple Feature Sets with Parallel Attention

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization.

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Spatial Attention Model‐modulated Bi‐directional Long Short‐term Memory for Unsupervised Video Summarisation

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Spatiotemporal Two-Stream LSTM Network for Unsupervised Video Summarization

Deep Attentive Video Summarization with Distribution Consistency Learning

Hierarchical multi‐modal video summarization with dynamic sampling

Graph Attention Networks Adjusted Bi-LSTM for Video Summarization

From Coarse to Fine: Hierarchical Structure-aware Video Summarization

CSTA: CNN-based Spatiotemporal Attention for Video Summarization

Exploring global diverse attention via pairwise temporal relation for video summarization