Abstract:Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.

Transformer-based Video Summarization with Spatial-Temporal Representation

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

Creating Personalized Video Summaries Via Semantic Event Detection

An Unsupervised Video Summarization Method Based on Multimodal Representation.

A GAN Based Video Summarization Method with Representation Loss

Video Summarization Generation Model Based on Transformer and Deep Reinforcement Learning

Video summarization with u-shaped transformer

Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network

From Thumbnails to Summaries - A single Deep Neural Network to Rule Them All

Video Summarization Via Weighted Neighborhood Based Representation.

Unsupervised Video Summarization Based on An Encoder-Decoder Architecture

Video Summarization with a Dual-Path Attentive Network

VIDEO SUMMARIZATION VIA TEMPORAL COLLABORATIVE REPRESENTATION OF ADJACENT FRAMES

Multi-Level Spatiotemporal Network for Video Summarization

Category Driven Deep Recurrent Neural Network for Video Summarization

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Deep Multi-Scale Pyramidal Features Network for Supervised Video Summarization.

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Spatiotemporal Two-Stream LSTM Network for Unsupervised Video Summarization

Video Summarization Using Deep Neural Networks: A Survey

Topic-aware Video Summarization Using Multimodal Transformer