Exploiting spatial and temporal context for online tracking with improved transformer
Jianwei Zhang,Jingchao Wang,Huanlong Zhang,Mengen Miao,Jie Zhang,Di Wu
DOI: https://doi.org/10.1016/j.imavis.2023.104672
IF: 3.86
2023-04-09
Image and Vision Computing
Abstract:At present, the transformer is becoming more and more popular in computer vision tasks due to its ability to capture long-range dependencies via self-attention. In this paper, we propose a transformer-based classification regression network TrCAR utilizing the transformer to exploit deeper spatial and temporal context. Different from the classic architecture of the transformer, we introduce convolution operation into the transformer and change the calculation of features to make it suitable for the tracking task. After that, the improved transformer encoder is introduced into the regression branch of TrCAR and combined with the feature pyramid to complete multi-layer feature fusion, which is conducive to obtaining a high-quality target representation. To further enable the target model to adapt to the change of the target appearance, we bring the gradient descent to the regression branch so that it can be updated online to produce a more precise bounding box. Meanwhile, the new transformer is integrated into the classification branch of TrCAR, which as much as possible extracts the essential feature of the target across historical frames via the global computing capability, and uses it to emphasize the target position of the current frame via cross-attention. Which helps the classifier to more easily identify the correct target. Experimental results on OTB, LaSOT, VOT2018, NFS, GOT-10 k, and TrackingNet benchmarks show that our TrCAR achieves comparable performance to the popular trackers.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, software engineering,optics