Optimizing intrinsic representation for tracking

Yuanming Zhang,Hao Sun
DOI: https://doi.org/10.1016/j.knosys.2024.111955
IF: 8.139
2024-05-23
Knowledge-Based Systems
Abstract:Learning target representations dedicated to tracking tasks is a promising direction in visual object tracking. Current state-of-the-art approaches exploit pixel appearance reconstruction of auto-encoders for representation training. Such pixel ways are redundant for the high-level semantics of tracking and suffer from an optimization gap where better appearance reconstruction does not correspond to better tracking performance. This article explores the problem of reconstruction misalignment of target representations in Siamese network architectures. We propose an optimization framework for intrinsic representation that restores the high-level feature loss between the networks on the reconstruction side and the target side. This intrinsic representation is generated by feeding the masked image tokens to the target side branch. Unlike the pixel ways, our approach learns to reconstruct the semantic latent level of representation for tracking objects. Moreover, we investigate feeding strategies from the information bottleneck perspective for masked patches to increase the masking performance yet reduce the computational complexity. Additionally, robust propagation mechanisms for the weight parameters of two-sided networks are also explored. Comprehensive evaluations on seven tracking benchmarks demonstrate our effectiveness: GOT-10k, LaSOT, TrackingNet, NFS, UAV123, TNL2K, and OTB100. Compared to the previous best pixel tracker, we outperform it on all seven datasets, with a maximum absolute score of 5.9% on average overlap and a Graphics Processing Unit (GPU) speed of 109 frames per second.
computer science, artificial intelligence
What problem does this paper attempt to address?