Multi-modal Understanding and Generation for Object Tracking

Hong Zhu,Pingping Zhang,Lei Xue,Guanglin Yuan
DOI: https://doi.org/10.1109/tcsvt.2024.3510735
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Vision-Language Tracking (VLT) aims to predict the target state in video sequences using two types of heterogeneous information: 1) the static text description detailing main characteristics of the tracked object, and 2) the dynamic image patches containing the target and its surroundings. However, as the tracking proceeds, inconsistencies may arise between the linguistic information embedded in the text description and the visual representations stored in the search images. In such cases, the direct fusion of vision and language could result in conflicts. To tackle this issue, we propose MugTracker, which integrates image-to-text generation into the VLT framework and attempts a generative updating way to mitigate the effects of inconsistencies. Specifically, we design two branch tasks: multi-modal understanding for reasoning and multi-modal generation for updating. We develop a dynamic text generator based on the hybrid architecture of the pre-trained foundation model BLIP and adaptively update the text reference as the context varies for more accurate target modeling. The semantically consistent visual and linguistic representations are then aligned and associated by the reasoning branch built on the BLIP dual-encoder to infer the target state. To better transfer the foundation model to build a strong tracker, we introduce the proposed TE-Adapter in the visual components for target enhancement and Text-Adapter in the linguistic components to strengthen the learning of discriminative semantics. Our MugTracker has been extensively evaluated on three datasets, and the superior performance compared to the state-of-the-arts demonstrates its effectiveness.
What problem does this paper attempt to address?