Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

ElasticTok: Adaptive Tokenization for Image and Video

Motion Guided Token Compression for Efficient Masked Video Modeling

Adaptive Length Image Tokenization via Recurrent Allocation

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Principles of Visual Tokens for Efficient Video Understanding

An Image is Worth 32 Tokens for Reconstruction and Generation

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Efficient Video Transformers with Spatial-Temporal Token Selection

Token Shift Transformer for Video Classification

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Language-Guided Image Tokenization for Generation

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Make A Long Image Short: Adaptive Token Length for Vision Transformers

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Dynamic and Compressive Adaptation of Transformers From Images to Videos

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Efficient Video Action Detection with Token Dropout and Context Refinement.

VidToMe: Video Token Merging for Zero-Shot Video Editing