Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling.

Pan Chen,Shaohui Lin,Yongxiang Zhang,Jiachen Xu,Xin Tan,Lizhuang Ma
DOI: https://doi.org/10.1007/978-3-030-92273-3_20
2021-01-01
Abstract:Compressed video action recognition targets at classifying action class in compressed video, instead of decoded/standard video. It benefits from fast training and inference by reducing the utilization of redundant information. However, off-the-shelf methods still rely on heavy-cost labels for training. In this paper, we propose self-supervised compressed video action recognition method via Momentum contrast (MoCo) and temporal-consistent sampling. We leverage temporal-consistent sampling into MoCo to improve the ability of feature presentation on each input modality of compressed video. Modality-oriented fine-tuning is introduced to applying into the downstream compressed video action recognition. Extensive experiments demonstrate the effectiveness of our method on different datasets with different backbones. Compared to SOTA self-supervised learning methods for decoded videos on HMDB51 dataset, our method achieves the highest accuracy of 57.8%.
What problem does this paper attempt to address?