CMC: Video Transformer Acceleration Via CODEC Assisted Matrix Condensing

Zhuoran Song,Chunyu Qi,Fangxin Liu,Naifeng Jing,Xiaoyao Liang
DOI: https://doi.org/10.1145/3620665.3640393
2024-01-01
Abstract:Video Transformers (VidTs) have reached the forefront of accuracy in various video understanding tasks. Despite their remarkable achievements, the processing requirements for a large number of video frames still present a significant performance bottleneck, impeding their deployment to resource-constrained platforms. While accelerators meticulously designed for Vision Transformers (ViTs) have emerged, they may not be the optimal solution for VidTs, primarily due to two reasons. These accelerators tend to overlook the inherent temporal redundancy that characterizes VidTs, limiting their chance for further performance enhancement. Moreover, incorporating a sparse attention prediction module within these accelerators incurs a considerable overhead. To this end, we move our attention to the video CODEC, which is essential for video preprocessing and can be utilized to detect the temporal and spatial similarity in raw video frames, showcasing the potential of exploring the temporal and spatial redundancy in VidTs while avoiding significant costs on prediction. This paper proposes CMC, the first CODEC assisted algorithm-accelerator co-design framework (CMC) for VidT acceleration. Specifically, from the algorithm aspects, we offer CODEC-friendly inter- and intra-matrix prediction algorithms to identify the informative data on-the-fly. We then design a recovery algorithm so that we can safely skip the computation on non-informative data in the temporal and spatial domains and recover their results by copying the informative data's features to reserve accuracy. From the hardware aspects, we propose to augment the video CODEC to make it efficiently implement inter- and intra-matrix prediction algorithms with negligible costs. Additionally, we propose a specialized CMC architecture that includes a recovery engine with fine-grained buffer management to translate the computational saving in the algorithm to real speedup. Experiments show that CMC can achieve 2.1×, 8.8×, and 42.4× speedup over state-of-the-art ViT accelerator HeatViT, A100 GPU, and Xeon CPU with negligible accuracy loss.
What problem does this paper attempt to address?