Empirical Study of Unsupervised Pre-Training in CNN and Transformer Based Visual Tracking

Yannan Cai,Zhenzhong Wei
DOI: https://doi.org/10.1109/icaica58456.2023.10405496
2023-01-01
Abstract:Deep learning-based visual object tracking has seen the emergence of CNN-based and Transformer-based algorithms built upon the Siamese-based pipeline to pursue robustness and accuracy. However, the performance gap between them requires high-quality and large-scale labeled data for sufficient training. In this work, we design an unsupervised pre-training scheme based on data augmentation to reduce the dependence on expensive labeled data. The core step is the object localization pretext task, which randomly crops the object and pastes it onto several background images. Moreover, we apply the method to both CNN-based and Transformer-based visual trackers. Extensive experiments on public datasets demonstrate that our method outperforms prevailing unsupervised trackers on large-scale benchmarks such as LaSOT and TrackingNet. Additionally, a simple strategy of freezing the CNN backbone during Transformer-based pre-training proves to be effective.
What problem does this paper attempt to address?