Abstract:Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges of Visual Object Tracking (VOT) under different data modalities. Specifically, VOT is an important research area in computer vision, aiming to identify and track specific targets in video sequences, and these targets can be arbitrary and of unknown categories. VOT technology can be applied to a variety of scenarios and process data from different modalities, such as RGB, thermal infrared, and point cloud, etc. ### Main Problems and Challenges 1. **Complex Appearance Changes**: The target may experience deformation, rotation, scale change, motion blur, and out - of - view problems. 2. **Background Interference**: Factors such as illumination change, interference from similar objects, occlusion, and cluttered background will affect tracking. 3. **Sensor Movement**: The video - capture sensor may shake or move, further increasing the difficulty of tracking. ### The Need for Multi - Modal Fusion Since a single sensor cannot cope with all dynamic and changing environments, multi - modal VOT has become a research hotspot. For example: - **RGB + Depth**: It is helpful for occlusion reasoning and providing geometric information. - **RGB + Thermal**: It can work at night and in bad weather and is not sensitive to similar textures and dark backgrounds. - **RGB + LiDAR**: It provides rich 3D geometric and depth information and is suitable for applications that require accurate 3D information, such as autonomous driving. - **RGB + Language**: It can more intuitively express the semantic information of the target object and is suitable for human - computer interaction. ### Main Contributions of the Paper 1. **Comprehensive Review**: Systematically reviewed single - modal and multi - modal VOT methods from the perspective of data modalities, covering the latest deep - learning - based methods. 2. **Framework Summary**: Summarized four widely - used single - modal DNN tracker frameworks and abstracted their patterns and existing successors. 3. **Latest Progress**: Provided a comprehensive overview of more than 300 of the latest and advanced methods in the VOT field. 4. **Benchmark Comparison**: Showed a wide range of comparisons of various modalities on common benchmarks and gave in - depth discussions and future research directions. By solving these problems, this paper provides important references and guidance for research in the field of visual object tracking.

Visual Object Tracking across Diverse Data Modalities: A Review

Multi-modal visual tracking: Review and experimental comparison

Visual Object Tracking on Multi-modal RGB-D Videos: A Review

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Visual object tracking: A survey

Awesome Multi-modal Object Tracking

Single Object Tracking Research: A Survey

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Video Object Segmentation and Tracking: A Survey

Advances in Deep Learning Methods for Visual Tracking: Literature Review and Fundamentals

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

Deep visual tracking: Review and experimental comparison

Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective

Multi-features Guided Robust Visual Tracking.

A Survey of Multi-object Video Tracking Algorithms

The Visual Object Tracking VOT2013 Challenge Results

Survey of Video Object Tracking

Divert More Attention to Vision-Language Object Tracking

Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Review of Multi-Object Tracking Based on Deep Learning