Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

A Compression and Recognition Joint Model for Structured Video Surveillance Storage

Foreground-Background Parallel Compression with Residual Encoding for Surveillance Video

Motion Guided Token Compression for Efficient Masked Video Modeling

Joint Feature and Texture Coding: Toward Smart Video Representation Via Front-End Intelligence

Robust moving object segmentation in the compressed domain for H.264/AVC video stream

Joint Compression of Near-Duplicate Videos.

Collaborative Scalable Visual Compression for Human-Centered Videos.

Intelligent Analysis Oriented Surveillance Video Coding.

Applications of just-noticeable depth difference model in joint multiview video plus depth coding

Spatiotemporal Attention-based Semantic Compression for Real-time Video Recognition

A Joint Compression Scheme of Video Feature Descriptors and Visual Content.

Memory-Efficient Network for Large-scale Video Compressive Sensing

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

Joint Feature Optimization and Fusion for Compressed Action Recognition

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

EfficientSCI: Densely Connected Network with Space-time Factorization for Large-scale Video Snapshot Compressive Imaging

Key Frames Assisted Hybrid Encoding for High-Quality Compressive Video Sensing

DMVC: Multi-Camera Video Compression Network aimed at Improving Deep Learning Accuracy

A Unified Framework for Jointly Compressing Visual and Semantic Data

A Compressive Prior Guided Mask Predictive Coding Approach for Video Analysis.

Increasing Compression Ratio of Low Complexity Compressive Sensing Video Encoder with Application-Aware Configurable Mechanism