Abstract:Video Transformers (VidTs) have reached the forefront of accuracy in various video understanding tasks. Despite their remarkable achievements, the processing requirements for a large number of video frames still present a significant performance bottleneck, impeding their deployment to resource-constrained platforms. While accelerators meticulously designed for Vision Transformers (ViTs) have emerged, they may not be the optimal solution for VidTs, primarily due to two reasons. These accelerators tend to overlook the inherent temporal redundancy that characterizes VidTs, limiting their chance for further performance enhancement. Moreover, incorporating a sparse attention prediction module within these accelerators incurs a considerable overhead. To this end, we move our attention to the video CODEC, which is essential for video preprocessing and can be utilized to detect the temporal and spatial similarity in raw video frames, showcasing the potential of exploring the temporal and spatial redundancy in VidTs while avoiding significant costs on prediction. This paper proposes CMC, the first CODEC assisted algorithm-accelerator co-design framework (CMC) for VidT acceleration. Specifically, from the algorithm aspects, we offer CODEC-friendly inter- and intra-matrix prediction algorithms to identify the informative data on-the-fly. We then design a recovery algorithm so that we can safely skip the computation on non-informative data in the temporal and spatial domains and recover their results by copying the informative data's features to reserve accuracy. From the hardware aspects, we propose to augment the video CODEC to make it efficiently implement inter- and intra-matrix prediction algorithms with negligible costs. Additionally, we propose a specialized CMC architecture that includes a recovery engine with fine-grained buffer management to translate the computational saving in the algorithm to real speedup. Experiments show that CMC can achieve 2.1×, 8.8×, and 42.4× speedup over state-of-the-art ViT accelerator HeatViT, A100 GPU, and Xeon CPU with negligible accuracy loss.

CMC: Video Transformer Acceleration Via CODEC Assisted Matrix Condensing

Motion Guided Token Compression for Efficient Masked Video Modeling

A Computationally Efficient Neural Video Compression Accelerator Based on a Sparse CNN-Transformer Hybrid Network

Multi-view Video Coding Based on View Prediction

DCMR: Degradation compensation and multi-dimensional reconstruction based pre-processing for video coding (ChinaMM)

Fast DST-VII/DCT-VIII With Dual Implementation Support for Versatile Video Coding

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

Faster Intra-Prediction of Versatile Video Coding Using a Concatenate-Designed CNN via DCT Coefficients

Multiscale Motion-Aware and Spatial-Temporal-Channel Contextual Coding Network for Learned Video Compression

Enhanced Motion Compensated Temporal Filter for VVenC

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

Spatial-Temporal Transformer based Video Compression Framework

Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics

Affine Motion Estimation Hardware Implementation with 51.7% / 67.5% Internal Bandwidth Reduction for Versatile Video Coding

Fast VVC Intra Encoding for Video Coding for Machines

VcLLM: Video Codecs are Secretly Tensor Codecs

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

A Reconfigurable Multiple Transform Selection Architecture for VVC

FMViT: A multiple-frequency mixing Vision Transformer

Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics