Cross-modal Target Retrieval for Tracking by Natural Language

Yihao Li,Jun Yu,Zhongpeng Cai,Yuwen Pan
DOI: https://doi.org/10.1109/cvprw56347.2022.00540
2022-01-01
Abstract:Tracking by natural language specification in a video is a challenging task in computer vision. Distinct from initializing the target state only by the bounding box in the first frame, language specification has a strong potential to assist visual object trackers to capture appearance variation and eliminate semantic ambiguity of the tracked object. In this paper, we carefully design a unified local-global-search framework from the perspective of cross-modal retrieval, including a local tracker, an adaptive retrieval switch module, and a target-specific retrieval module. The adaptive retrieval switch module aligns semantics from the visual signal and the lingual description of the target using three sub-modules, i.e., object-aware attention memory, part-aware cross-attention, and vision-language contrast, which achieve an automatic switch between local search and global search. When booting the global search mechanism, the target-specific retrieval module re-localizes the missing target in the image-wide range via an efficient vision-language guided proposal selector and target-text match. Numerous experimental results on three prevailing benchmarks show the effectiveness and generalization of our framework.
What problem does this paper attempt to address?