Abstract:Visual Language Tracking (VLT) enhances tracking by mitigating the limitations of relying solely on the visual modality, utilizing high-level semantic information through language. This integration of the language enables more advanced human-machine interaction. The essence of interaction is cognitive alignment, which typically requires multiple information exchanges, especially in the sequential decision-making process of VLT. However, current VLT benchmarks do not account for multi-round interactions during tracking. They provide only an initial text and bounding box (bbox) in the first frame, with no further interaction as tracking progresses, deviating from the original motivation of the VLT task. To address these limitations, we propose a novel and robust benchmark, VLT-MI (Visual Language Tracking with Multi-modal Interaction), which introduces multi-round interaction into the VLT task for the first time. (1) We generate diverse, multi-granularity texts for multi-round, multi-modal interaction based on existing mainstream VLT benchmarks using DTLLM-VLT, leveraging the world knowledge of LLMs. (2) We propose a new VLT interaction paradigm that achieves multi-round interaction through text updates and object recovery. When multiple tracking failures occur, we provide the tracker with more aligned texts and corrected bboxes through interaction, thereby expanding the scope of VLT downstream tasks. (3) We conduct comparative experiments on both traditional VLT benchmarks and VLT-MI, evaluating and analyzing the accuracy and robustness of trackers under the interactive paradigm. This work offers new insights and paradigms for the VLT task, enabling a fine-grained evaluation of multi-modal trackers. We believe this approach can be extended to additional datasets in the future, supporting broader evaluations and comparisons of video-language model capabilities.

Visual and Language Collaborative Learning for RGBT Object Tracking

Temporal Adaptive RGBT Tracking with Modality Prompt

Multi-features Guided Robust Visual Tracking.

TLPG-Tracker: Joint Learning of Target Localization and Proposal Generation for Visual Tracking.

Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective

Unifying Visual and Vision-Language Tracking via Contrastive Learning

Exploring fusion strategies for accurate RGBT visual object tracking

RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

RGBT Tracking via Challenge-Based Appearance Disentanglement and Interaction

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

RGB-T Object Tracking:Benchmark and Baseline

Divert More Attention to Vision-Language Object Tracking

Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark

RGBT tracking based on cooperative low-rank graph model

Visual Prompt Multi-Modal Tracking

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Real-time Visual Object Tracking with Natural Language Description

Visual Object Tracking Via Guessing and Matching

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM