CSART: Channel and Spatial Attention-Guided Residual Learning for Real-Time Object Tracking.
Dawei Zhang,Zhonglong Zheng,Minglu Li,Rixian Liu
DOI: https://doi.org/10.1016/j.neucom.2020.11.046
IF: 6
2020-01-01
Neurocomputing
Abstract:Siamese networks have achieved great success in object tracking due to the balance of precision and speed. However, Siamese trackers usually utilize the local feature of the last layer, which may degrade tracking performance in some difficult scenarios. In this paper, we propose a novel Channel and Spatial Attention-guided Residual learning framework for Tracking, referred to as CSART, which can improve feature representation of Siamese networks by exploiting self-attention mechanism to capture powerful contextual information. Specifically, to be efficient and seamless integration, different kinds of self-attention are appended on the template and search branches of Siamese networks respectively, that model global semantic inter-dependencies in channel and spatial dimensions. To avoid representation degradation, we consider to adaptively aggregate basic feature and its attention-weighted features with residual learning. Furthermore, a joint loss consisting of classic logistic loss as well as focal softmax loss is designed to emphasize difficult samples and guide the learning process of the whole model. Benefiting from the above scheme, CSART alleviates the over-fitting problem to some extent and enhances the discriminability. Extensive experiments on six popular tracking datasets indicate that the proposed tracker achieves better performance with a speed of 65 fps than other state-of-the-art trackers. (c) 2020 Elsevier B.V. All rights reserved. Siamese networks have achieved great success in object tracking due to the balance of precision and speed. However, Siamese trackers usually utilize the local feature of the last layer, which may degrade tracking performance in some difficult scenarios. In this paper, we propose a novel Channel and Spatial Attention-guided Residual learning framework for Tracking, referred to as CSART, which can improve feature representation of Siamese networks by exploiting self-attention mechanism to capture powerful contextual information. Specifically, to be efficient and seamless integration, different kinds of self-attention are appended on the template and search branches of Siamese networks respectively, that model global semantic inter-dependencies in channel and spatial dimensions. To avoid representation degradation, we consider to adaptively aggregate basic feature and its attention-weighted features with residual learning. Furthermore, a joint loss consisting of classic logistic loss as well as focal softmax loss is designed to emphasize difficult samples and guide the learning process of the whole model. Benefiting from the above scheme, CSART alleviates the over-fitting problem to some extent and enhances the discriminability. Extensive experiments on six popular tracking datasets indicate that the proposed tracker achieves better performance with a speed of 65 fps than other state-of-the-art trackers.