Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Motion Guided Token Compression for Efficient Masked Video Modeling

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Video Frame Prediction by Deep Multi-Branch Mask Network

Adaptive Recurrent Frame Prediction with Learnable Motion Vectors.

Video Frame Prediction from a Single Image and Events

Transframer: Arbitrary Frame Prediction with Generative Models

MaskViT: Masked Visual Pre-Training for Video Prediction

Optimizing Video Prediction via Video Frame Interpolation

Predicting Long-horizon Futures by Conditioning on Geometry and Time

Pair-wise Layer Attention with Spatial Masking for Video Prediction

Looking-Ahead: Neural Future Video Frame Prediction

Future Frame Prediction for Robot-assisted Surgery

Flexible Spatio-Temporal Networks for Video Prediction

Future Frame Prediction for Anomaly Detection -- A New Baseline

Video prediction: a step-by-step improvement of a video synthesis network

GSSTU: Generative Spatial Self-Attention Transformer Unit for Enhanced Video Prediction

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Adaptive Future Frame Prediction with Ensemble Network

Predicting Future Instance Segmentation by Forecasting Convolutional Features