Language-guided Dual-modal Local Correspondence for Single Object Tracking
Jun Yu,Zhongpeng Cai,Yihao Li,Lei Wang,Fang Gao,Ye Yu
DOI: https://doi.org/10.1109/tmm.2024.3410141
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:This paper focuses on the advancement of single object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms stateof-the-art methods, which demonstrates the effectiveness and efficiency of our approach