PromptVT: Prompting for Efficient and Accurate Visual Tracking

Minghua Zhang,Qiuyang Zhang,Wei Song,Dongmei Huang,Qi He
DOI: https://doi.org/10.1109/tcsvt.2024.3376582
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:While existing lightweight visual trackers can run in real-time at edge devices, they face the difficulty of object appearance changes. An effective solution to this problem is to add an online updatable dynamic template for trackers to learn about changes in target appearance over time. However, existing dynamic template utilization methods are unsuitable for lightweight networks, resulting in limited accuracy improvement and a significant increase in computational workload. In this paper, we propose PromptVT, an efficient and accurate video tracking framework, which consists of two important designs: a plug-and-play dynamic template prompter (DTP) and a hierarchical multi-scale transformer (HMT). The DTP module guides networks to effectively learn changes between initial and dynamic templates through two prompts without additional computational workload. The HMT module combines spatial features of the search area and template at different scales and levels, enabling the tracker to learn a more comprehensive visual representation. Our proposed PromptVT outperforms state-of-the-art real-time trackers on eight benchmarks (VOT2020, LaSOT, GOT-10K, UAV123, AntiUAV, AntiUAV410, TrackingNet, OTB100) while running at 52 fps (PyTorch model) and 76 fps (ONNX model) on CPUs, with only 2.9G FLOPs and 3M parameters. Code and models are available at https://github.com/faicaiwawa/PromptVT.
engineering, electrical & electronic
What problem does this paper attempt to address?