Abstract:Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.

MTIDNet: A Multimodal Temporal Interest Detection Network for Video Summarization

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

Creating Personalized Video Summaries Via Semantic Event Detection

Memorable and Rich Video Summarization

An Unsupervised Video Summarization Method Based on Multimodal Representation.

A Human-Machine Collaborative Video Summarization Framework Using Pupillary Response Signals

DSNet: A Flexible Detect-to-Summarize Network for Video Summarization

Video Summarization via Semantic Attended Networks.

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Video summarization via knowledge-aware multimodal deep networks

Interactive Video Summarization with Human Intentions

Exploring global diverse attention via pairwise temporal relation for video summarization

Deep Multi-Scale Pyramidal Features Network for Supervised Video Summarization.

Effective Video Summarization by Extracting Parameter-free Motion Attention

VideoXum: Cross-modal Visual and Textural Summarization of Videos

TLDW: Extreme Multimodal Summarisation of News Videos

Hierarchical multi‐modal video summarization with dynamic sampling

MHMS: Multimodal Hierarchical Multimedia Summarization