Abstract:Video summarization aims to analyze the structure and content of videos and extract key segments to construct summarization that can accurately summarize the main content, allowing users to quickly access the core information without browsing the full video. However, existing methods have difficulties in capturing long-term dependencies when dealing with long videos. On the other hand, there is a large amount of noise in graph structures, which may lead to the influence of redundant information and is not conducive to the effective learning of video features. To solve the above problems, we propose a video summarization generation network based on dynamic graph contrastive learning and feature fusion, which mainly consists of three modules: feature extraction, video encoder, and feature fusion. Firstly, we compute the shot features and construct a dynamic graph by using the shot features as nodes of the graph and the similarity between the shot features as the weights of the edges. In the video encoder, we extract the temporal and structural features in the video using stacked L-G Blocks, where the L-G Block consists of a bidirectional long short-term memory network and a graph convolutional network. Then, the shallow-level features are obtained after processing by L-G Blocks. In order to remove the redundant information in the graph, graph contrastive learning is used to obtain the optimized deep-level features. Finally, to fully exploit the feature information of the video, a feature fusion gate using the gating mechanism is designed to fully fuse the shallow-level features with the deep-level features. Extensive experiments are conducted on two benchmark datasets, TVSum and SumMe, and the experimental results show that our proposed method outperforms most of the current state-of-the-art video summarization methods.

What problem does this paper attempt to address?

This paper attempts to address two main issues in video summarization: 1. **Difficulty in capturing long-term dependencies**: Existing methods struggle to effectively capture key information when dealing with long videos. In video sequences, important information may be distributed throughout the entire video, making it difficult for traditional methods to extract this information effectively. 2. **Noise in graph structures**: There is a significant amount of noise in graph structures, which can lead to redundant information interfering with the learning of video features, thereby affecting the quality of the video summary. To solve these problems, the authors propose a video summarization network based on dynamic graph contrastive learning and feature fusion. This network mainly consists of three modules: feature extraction, video encoder, and feature fusion. Through these modules, the method can effectively capture long-term dependencies in videos and reduce noise in graph structures, thereby generating more accurate video summaries.

Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

A Novel Compact Yet Rich Key Frame Creation Method for Compressed Video Summarization

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

A GAN Based Video Summarization Method with Representation Loss

Creating Personalized Video Summaries Via Semantic Event Detection

Memorable and Rich Video Summarization

Dynamic graph convolutional network for multi-video summarization

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Reconstructive Sequence-Graph Network for Video Summarization

Feature fusion over hyperbolic graph convolution networks for video summarisation

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Video summarization via knowledge-aware multimodal deep networks

Category Driven Deep Recurrent Neural Network for Video Summarization

Relational Reasoning over Spatial-Temporal Graphs for Video Summarization

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Multi-View Video Summarization

Graph Attention Networks Adjusted Bi-LSTM for Video Summarization

Video Summarization Using Knowledge Distillation-Based Attentive Network

Video Summarization Generation Model Based on Transformer and Deep Reinforcement Learning