Abstract:Video processing and analysis have become an urgent task, as a huge amount of videos (e.g., YouTube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and efficiency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines the convolutional neural network and temporal segment density peaks clustering. The proposed temporal segment density peaks clustering is a generic and powerful framework, and it has two advantages compared with previous works. One is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus, it improves the efficiency of video classification. Furthermore, a long short-term memory network is added on the top of the convolutional neural network to further elevate the performance of classification. Moreover, a weight fusion strategy of different input networks is presented to boost performance. By optimizing both video classification and key frame extraction simultaneously, we achieve better classification performance and higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101), and the experimental results consistently demonstrate that our strategy achieves competitive performance and efficiency compared with the state-of-the-art approaches.

A Differentiable Parallel Sampler for Efficient Video Classification

OCSampler: Compressing Videos to One Clip with Single-step Sampling

Watching a Small Portion Could Be As Good As Watching All: Towards Efficient Video Classification.

MGSampler: an Explainable Sampling Strategy for Video Action Recognition

Swift Sampler: Efficient Learning of Sampler by 10 Parameters

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

SiamSampler: Video-Guided Sampling for Siamese Visual Tracking

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Efficient Video Segmentation Models with Per-frame Inference

A Dynamic Frame Selection Framework for Fast Video Recognition.

A Closer Look at Video Sampling for Sequential Action Recognition

Deep Unsupervised Key Frame Extraction for Efficient Video Classification

Learning to Upsample by Learning to Sample

FASTER Recurrent Networks for Efficient Video Classification

Beyond the Prototype: Divide-and-conquer Proxies for Few-shot Segmentation

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

Efficient Semantic Video Segmentation with Per-Frame Inference

Scalable Video Object Segmentation with Simplified Framework

Sampling-Priors-Augmented Deep Unfolding Network for Robust Video Compressive Sensing

Adaptive Focus for Efficient Video Recognition