Abstract:Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on <a class="link-external link-http" href="http://videocube.aitestunion.com/" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in the existing Visual Language Tracking (VLT) benchmark: 1. **Single - granularity text annotation**: - Most current VLT benchmarks rely on short, manually - annotated text descriptions, which often fail to capture the dynamic changes and semantic diversity of video content. This single - granularity annotation method limits the algorithm's in - depth understanding of video content. 2. **Limitations of text descriptions**: - Existing text annotations are usually focused on the first frame of the video and lack the continuity of spatio - temporal information. This causes the algorithm to easily fall into the "memorize the answer" mode rather than truly understanding the video content. 3. **Inefficiency of manual annotation**: - Manually performing high - quality multi - granularity text annotation for large - scale datasets is a time - consuming and resource - intensive process, which is difficult to meet research requirements. 4. **Performance bottlenecks of existing algorithms**: - Existing VLT algorithms perform poorly when faced with diverse text descriptions, especially when dealing with unseen texts, with a significant drop in performance. This indicates that current algorithms have deficiencies in generalization ability. To solve these problems, the authors propose a new multimodal benchmark - DTVLT (Diverse Text Visual Language Tracking), and introduce a generation method DTLLM - VLT based on large - language models (LLM). Through this method, the authors can generate diverse text descriptions for five mainstream VLT and SOT (Single Object Tracking) benchmarks, covering semantic information of different granularities. Specifically, they provide four different - granularity text descriptions: initial concise description, initial detailed description, dense concise description, and dense detailed description. These diverse text descriptions help to more comprehensively evaluate and improve the performance of VLT algorithms, and promote the research progress of video understanding and multimodal learning. ### Main contributions: 1. **Propose a new VLT benchmark**: Construct a new VLT benchmark named DTVLT, which covers five mainstream VLT and SOT benchmarks and includes three tracking tasks: short - term tracking, long - term tracking, and global instance tracking. 2. **Provide multi - granularity text**: Generate four different - granularity high - quality text descriptions through the DTLLM - VLT method, enriching the semantic information. 3. **Comprehensive experimental analysis**: Conduct extensive experimental analysis, evaluate the impact of diverse texts on tracking performance, and reveal the performance bottlenecks of existing algorithms, providing directions for future research. Through these contributions, the authors hope to provide a more flexible and comprehensive environment for VLT and video understanding research.

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark

How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking

Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

VastTrack: Vast Category Visual Object Tracking

Divert More Attention to Vision-Language Object Tracking

Unifying Visual and Vision-Language Tracking via Contrastive Learning

LVBench: An Extreme Long Video Understanding Benchmark

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

LaMOT: Language-Guided Multi-Object Tracking

DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and Small Text

A Multi-modal Global Instance Tracking Benchmark (MGIT): Better Locating Target in Complex Spatio-temporal and Causal Relationship

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI