Visual Object Tracking across Diverse Data Modalities: A Review

Mengmeng Wang,Teli Ma,Shuo Xin,Xiaojun Hou,Jiazheng Xing,Guang Dai,Jingdong Wang,Yong Liu
2024-12-13
Abstract:Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges of Visual Object Tracking (VOT) under different data modalities. Specifically, VOT is an important research area in computer vision, aiming to identify and track specific targets in video sequences, and these targets can be arbitrary and of unknown categories. VOT technology can be applied to a variety of scenarios and process data from different modalities, such as RGB, thermal infrared, and point cloud, etc. ### Main Problems and Challenges 1. **Complex Appearance Changes**: The target may experience deformation, rotation, scale change, motion blur, and out - of - view problems. 2. **Background Interference**: Factors such as illumination change, interference from similar objects, occlusion, and cluttered background will affect tracking. 3. **Sensor Movement**: The video - capture sensor may shake or move, further increasing the difficulty of tracking. ### The Need for Multi - Modal Fusion Since a single sensor cannot cope with all dynamic and changing environments, multi - modal VOT has become a research hotspot. For example: - **RGB + Depth**: It is helpful for occlusion reasoning and providing geometric information. - **RGB + Thermal**: It can work at night and in bad weather and is not sensitive to similar textures and dark backgrounds. - **RGB + LiDAR**: It provides rich 3D geometric and depth information and is suitable for applications that require accurate 3D information, such as autonomous driving. - **RGB + Language**: It can more intuitively express the semantic information of the target object and is suitable for human - computer interaction. ### Main Contributions of the Paper 1. **Comprehensive Review**: Systematically reviewed single - modal and multi - modal VOT methods from the perspective of data modalities, covering the latest deep - learning - based methods. 2. **Framework Summary**: Summarized four widely - used single - modal DNN tracker frameworks and abstracted their patterns and existing successors. 3. **Latest Progress**: Provided a comprehensive overview of more than 300 of the latest and advanced methods in the VOT field. 4. **Benchmark Comparison**: Showed a wide range of comparisons of various modalities on common benchmarks and gave in - depth discussions and future research directions. By solving these problems, this paper provides important references and guidance for research in the field of visual object tracking.