Transformer Vision-Language Tracking via Proxy Token Guided Cross-Modal Fusion

Haojie Zhao,Xiao Wang,Dong Wang,Huchuan Lu,Xiang Ruan
DOI: https://doi.org/10.1016/j.patrec.2023.02.023
IF: 4.757
2023-02-26
Pattern Recognition Letters
Abstract:Tracking by vision-language is an emergent topic. Previous researchers mainly adopt CNN and sequential models for video and language encoding, however, their methods are limited by poor generalization performance. To address this problem, this paper presents a novel vision-language tracking framework based on Transformer. Specifically, our proposed framework contains the image encoder, language encoder, cross-modal fusion module, and task-specific heads. We adopt the residual network and BERT for image and language embedding, respectively. More importantly, we propose a proxy token guided cross-modal fusion module based on the transformer network, which can link the vision and language features effectively and efficiently. The proxy token acts as a proxy for word embeddings and interacts with the visual feature. By absorbing vision information, the proxy token is used to modulate word embeddings and make them attend to the visual feature. Finally, we get the organically fused features via a dynamic modal aggregation method and feed them into the task-specific heads for tracking. Extensive experiments demonstrate that our method set new state-of-the-art on multiple language-assisted tracking datasets, including OTB-LANG, LaSOT, TNL2K, and a newly proposed Ref-LTB50 annotated with dense language specifications. Source code of this paper will be publicly available.
computer science, artificial intelligence
What problem does this paper attempt to address?