Abstract:Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.

Weakly Supervised Deep Reinforcement Learning for Video Summarization with Semantically Meaningful Reward

A GAN Based Video Summarization Method with Representation Loss

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward

Unsupervised Reinforcement Learning for Video Summarization Reward Function

Weakly-Supervised Video Summarization Using Variational Encoder-Decoder And Web Prior

Progressive Reinforcement Learning for Video Summarization

VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

A Video Summarization Model Based on Deep Reinforcement Learning with Long-Term Dependency

Unsupervised Video Summarization via Reinforcement Learning and a Trained Evaluator

Video Summarisation by Classification with Deep Reinforcement Learning

Self-Supervised Adversarial Video Summarizer with Context Latent Sequence Learning.

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Self-Attention Recurrent Summarization Network with Reinforcement Learning for Video Summarization Task.

Deep Attentive and Semantic Preserving Video Summarization

Video Summarization by Learning Deep Side Semantic Embedding

Video Summarization Based on Video-text Representation

Video Summarization Based on Video-text Modelling

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Action Parsing-Driven Video Summarization Based on Reinforcement Learning