Cross-scale content-based full Transformer network with Bayesian inference for object tracking
Shenghua Fan,Xi Chen,Chu He,Yan Huang,Kehan Chen
DOI: https://doi.org/10.1007/s11042-022-14162-7
IF: 2.577
2022-11-25
Multimedia Tools and Applications
Abstract:Visual tracking is fundamentally the problem of conditional probability regressing of the target location in each video frame. Convolutional neural network (CNN) have been dominant in visual tracking these years, but CNN-based trackers neglect long-range dependency in likelihood representation and prior information, these destroy the spatial consistency of target. Recently emerging Transformer-based trackers mitigate these, however, they do not possess the ability to build interactions among features of cross-scale. Moreover, the sine position encoding prior in Transformer-based tracker is content-unaware and fails to reflect the relative index of different positions. To address these issues and inspired by Bayesian probabilistic formulation, we propose a cross-scale full Transformer tracker with content-based prior bias (named BTT). There are four main contributions of the method, (i) we propose a hierarchical full Transformer tracking architecture to introduce long-range dependency, which enriches the likelihood representation of model, and alleviates the destruction of spatial consistency. (ii) An expanding layer without using convolution or interpolation operation is proposed to aggregate layer information of different scales to construct cross-scale likelihood estimation. (iii) We further demonstrate the defect of sine position encoding with mathematical derivation, and introduce a content-based positional encoding bias as prior in the Transformer architecture to reflect the relative index of inputs. (iv) And extensive experiments show that the proposed tracker achieves better performance compared with CNN-based trackers in cases of illumination, low resolution, deformation on various datasets, and achieves superior performance on others attributes. The proposed tracker obtains 70.3 % , 69.1 % , 63.4 % on OTB2015, UAV123, and LaSOT, respectively.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering