Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Motion Guided Token Compression for Efficient Masked Video Modeling

TransVOS: Video Object Segmentation with Transformers

Efficient Video Transformers with Spatial-Temporal Token Selection

HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

SVT: Supertoken Video Transformer for Efficient Video Understanding

Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

Efficient Vision Transformer via Token Merger

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Efficient Visual Transformer by Learnable Token Merging

Convolutional transformer network for fine-grained action recognition

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Spatio-Temporal Collaborative Module for Efficient Action Recognition

Efficient Video Action Detection with Token Dropout and Context Refinement.

Making Vision Transformers Efficient from A Token Sparsification View

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation