Abstract:Video summarization (VS) is generally formulated as a subset selection problem where a set of representative keyframes or key segments is selected from an entire video frame set. Though many sparse subset selection based VS algorithms have been proposed in the past decade, most of them adopt linear sparse formulation in the explicit feature vector space of video frames, and don't consider the local or global relationships among frames. In this paper, we first extend the conventional sparse subset selection for VS into kernel block sparse subset selection (KBS3) to utilize the advantage of kernel sparse coding and introduce a local inter-frame relationship through packing of frame blocks. Going a step further, we propose a similarity based block sparse subset selection (SB2S3) model by applying a specially designed transformation matrix on the KBS3 model in order to introduce a kind of global inter-frame relationship through the similarity. Finally, a greedy pursuit based algorithm is devised for the proposed NP-hard model optimization. The proposed SB2S3 has the following advantages: 1) through the similarity between each frame and any other frame, the global relationship among all frames can be considered; 2) through block sparse coding, the local relationship of adjacent frames is further considered; and 3) it has a wider application, since features can derive similarity, but not vice versa. It is believed that the effect of modeling such global and local relationships among frames in this paper, is similar to that of modeling the long-range and short-range dependencies among frames in deep learning based methods. Experimental results on three benchmark datasets have demonstrated that the proposed approach is superior to not only other sparse subset selection based VS methods but also most unsupervised deep-learning based VS methods.

Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network

A GAN Based Video Summarization Method with Representation Loss

Transformer-based Video Summarization with Spatial-Temporal Representation

An Unsupervised Video Summarization Method Based on Multimodal Representation.

A Video Summarization Model Based on Deep Reinforcement Learning with Long-Term Dependency

VIDEO SUMMARIZATION VIA TEMPORAL COLLABORATIVE REPRESENTATION OF ADJACENT FRAMES

Spatiotemporal Two-Stream LSTM Network for Unsupervised Video Summarization

Video Summarization with a Dual-Path Attentive Network

Multi-Level Spatiotemporal Network for Video Summarization

TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization

Relational Reasoning over Spatial-Temporal Graphs for Video Summarization

Category Driven Deep Recurrent Neural Network for Video Summarization

Video Summarization Via Weighted Neighborhood Based Representation.

Video Summarization Via Simultaneous Block Sparse Representation.

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

Patch Based Video Summarization with Block Sparse Representation.

Video Summarization via Semantic Attended Networks.

Similarity Based Block Sparse Subset Selection for Video Summarization.

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Deep Semantic and Attentive Network for Unsupervised Video Summarization