Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

F2D-SIFPNet: a Frequency 2D Slow-I-Fast-P Network for Faster Compressed Video Action Recognition

Frequency Enhancement Network for Efficient Compressed Video Action Recognition

Action Recognition with Stacked Fisher Vectors.

Motion Guided Token Compression for Efficient Masked Video Modeling

Dynamic Spatial Focus for Efficient Compressed Video Action Recognition

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition

Learning Discriminative Features for Fast Frame-Based Action Recognition.

Multi-Stream Single Network: Efficient Compressed Video Action Recognition With a Single Multi-Input Multi-Output Network

Diffused Fourier Network for Video Action Segmentation

TEINet: Towards an Efficient Architecture for Video Recognition.

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

SlowFast Networks for Video Recognition

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

F4D: Factorized 4D Convolutional Neural Network for Efficient Video-level Representation Learning

F2S-Net: learning frame-to-segment prediction for online action detection

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

Adaptive Focus for Efficient Video Recognition

A fast human action recognition network based on spatio-temporal features

FASTER Recurrent Networks for Efficient Video Classification

Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation