NJU MCG - Sensetime Team Submission to Pre-training for Video Understanding Challenge Track II.

Liwei Jin,Haoyue Cheng,Su Xu,Wayne Wu,Limin Wang
DOI: https://doi.org/10.1145/3474085.3479221
2021-01-01
Abstract:This paper presents the method that underlies our submission to the Pre-training for Video Understanding Challenge Track II. We follow the basic pipeline of temporal segment networks [20] and further improve its performance in several aspects. Specifically, we use the latest transformer-based architectures, e.g., Swin Transformer, DeiT, CLIP-ViT, to enhance the representation power. We analyze different pre-training proxy tasks on the official pre-training datasets and other open-source video datasets. With these techniques, we derive an ensemble of deep models to attain a high classification accuracy (Top-1 accuracy 62.28%) on the testing set and secures first place in Track II of this challenge.
What problem does this paper attempt to address?