MIMTracking: Masked image modeling enhanced vision transformer for visual object tracking

Shuo Zhang,Dan Zhang,Qi Zou
DOI: https://doi.org/10.1016/j.neucom.2024.128415
IF: 6
2024-08-19
Neurocomputing
Abstract:Recently, one-stage trackers achieve state-of-the-art tracking results due to more sufficient integration of search and template representations. These methods usually adopt an encoder for synchronous feature generation and interaction. Despite their high performance, the one-stage approaches tend to feed the encoder full input representations that are highly redundant and dense during training. With a focus on tackling this problem, a novel algorithm termed MIMTracking is developed for visual target tracking. MIMTracking exploits an encoder and a decoder for masked image modeling during training. Randomly sampled discrete search embeddings and template embeddings serve as input of the encoder. The lightweight decoder takes full representations as input and progressively highlights the target region. This design alleviates input redundancy and reduces the computational cost of the training process, thereby allowing for more efficient learning of useful representations. The proposed MIMTracking achieves state-of-the-art tracking results on numerous tracking datasets, e.g., 51.5 % area under curve (AUC) on LaSOT ext , outperforming the previous top tracker OSTrack by 2 %. Especially, our large model MIMTracking-L further improves the AUC to 53.4 % on LaSOT ext .
computer science, artificial intelligence
What problem does this paper attempt to address?