Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Memory-Augmented Transformer for Efficient End-to-End Video Grounding

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Motion Guided Token Compression for Efficient Masked Video Modeling

TransVOS: Video Object Segmentation with Transformers

Memory-enhanced Hierarchical Transformer for Video Paragraph Captioning

Temporal Feature Aggregation for Efficient 2D Video Grounding

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

End-to-End Dense Video Grounding via Parallel Regression

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Hierarchical Local-Global Transformer for Temporal Sentence Grounding.

Memory Consolidation Enables Long-Context Video Understanding

End-to-end Multi-modal Video Temporal Grounding

Efficient Video Grounding with Which-Where Reading Comprehension

TransVG: End-to-End Visual Grounding with Transformers

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Rethinking Video Sentence Grounding from a Tracking Perspective with Memory Network and Masked Attention

TALLFormer: Temporal Action Localization with a Long-memory Transformer

Transformer Based Memory Network for Video Anomaly Detection

GTLR: Graph-Based Transformer with Language Reconstruction for Video Paragraph Grounding

Mart: Memory-Augmented Recurrent Transformer For Coherent Video Paragraph Captioning