Abstract:Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on <a class="link-external link-http" href="http://videocube.aitestunion.com/" rel="external noopener nofollow">this http URL</a>.

A Multi-modal Global Instance Tracking Benchmark (MGIT): Better Locating Target in Complex Spatio-temporal and Causal Relationship

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Multi-features Guided Robust Visual Tracking.

Global Instance Tracking: Locating Target More Like Humans

MLGT: multi-local guided tracker for visual object tracking

Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark

Beyond SOT: Tracking Multiple Generic Objects at Once

Towards Generalizable Multi-Object Tracking

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

Multi-cue Based Multi-target Tracking with Boosted MHT.

Multi-Cue Based Tracking

MATI: Multimodal Adaptive Tracking Integrator for Robust Visual Object Tracking

MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

Multi-Timescale Collaborative Tracking

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

MM-Tracker: Visual Tracking with A Multi-Task Model Integrating Detection and Differentiating Feature Extraction

Awesome Multi-modal Object Tracking

GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

Towards Effective Multi-Moving-Camera Tracking: A New Dataset and Lightweight Link Model

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking