Abstract:Video summarization is an important technique to browse, manage and retrieve a large amount of videos efficiently. The main objective of video summarization is to minimize the information loss when selecting a subset of video frames from the original video, hence the summary video can faithfully represent the overall story of the original video. Recently developed unsupervised video summarization approaches are free of requiring tedious annotation on important frames to train a video summarization model and thus are practically attractive. However, their performance is still limited due to the difficulty of minimizing information loss between the summary and original videos. In this paper, we address unsupervised video summarization by developing a novel Cycle-consistent Adversarial LSTM architecture to effectively reduce the information loss in the summary video. The proposed model, named Cycle-SUM, consists of a frame selector and a cycle-consistent learning based evaluator. The selector is a bi-directional LSTM network to capture the long-range relationship between video frames. To overcome the difficulty of specifying a suitable information preserving metric between original video and summary video, the evaluator is introduced to "supervise" selector to improve the video summarization quality. Specifically, the evaluator is composed of two generative adversarial networks (GANs), in which the forward GAN component is learned to reconstruct the original video from summary video, while the backward GAN learns to invert the process. We establish the relation between mutual information maximization and such cycle learning procedure and further introduce cycle-consistent loss to regularize the summarization. Extensive experiments on three video summarization benchmark datasets demonstrate a state-of-the-art performance, and show the superiority of the Cycle-SUM model compared with other unsupervised approaches.

Spatiotemporal Two-Stream LSTM Network for Unsupervised Video Summarization

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

An Unsupervised Video Summarization Method Based on Multimodal Representation.

A GAN Based Video Summarization Method with Representation Loss

Deep Semantic and Attentive Network for Unsupervised Video Summarization

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Video Summarization with Long Short-term Memory

Spatial Attention Model‐modulated Bi‐directional Long Short‐term Memory for Unsupervised Video Summarisation

User-Ranking Video Summarization with Multi-Stage Spatio-Temporal Representation.

Learning Multiscale Hierarchical Attention for Video Summarization

Unsupervised Video Summarization With Cycle-Consistent Adversarial LSTM Networks

Video Summarization through Reinforcement Learning with a 3D Spatio-Temporal U-Net

Graph Attention Networks Adjusted Bi-LSTM for Video Summarization

CSTA: CNN-based Spatiotemporal Attention for Video Summarization

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Semantic Representation and Attention Alignment for Graph Information Bottleneck in Video Summarization

Cycle-SUM: Cycle-Consistent Adversarial LSTM Networks for Unsupervised Video Summarization.

Exploring global diverse attention via pairwise temporal relation for video summarization

Deep Attentive Video Summarization with Distribution Consistency Learning

Video Saliency Prediction using Spatiotemporal Residual Attentive Networks.