Awesome Multi-modal Object Tracking

Chunhui Zhang,Li Liu,Hao Wen,Xi Zhou,Yanfeng Wang
2024-05-31
Abstract:Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, \eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (\eg WebUAV-3M) and vision-depth-language (\eg UniMod1K). To track the latest progress in MMOT, we conduct a comprehensive investigation in this report. Specifically, we first divide existing MMOT tasks into five main categories, \ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X), where X can be any modality, such as language, depth, and event. Then, we analyze and summarize each MMOT task, focusing on widely used datasets and mainstream tracking algorithms based on their technical paradigms (\eg self-supervised learning, prompt learning, knowledge distillation, generative models, and state space models). Finally, we maintain a continuously updated paper list for MMOT at <a class="link-external link-https" href="https://github.com/983632847/Awesome-Multimodal-Object-Tracking" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address several key issues in multimodal object tracking (MMOT): 1. **Limitations of existing algorithms**: Although RGB-based object tracking methods have made significant progress over the past 10 years, they still fail to achieve precise and robust tracking in some complex situations, such as illumination changes, fast motion, occlusion, and appearance variations. 2. **Insufficient utilization of multimodal data**: Existing MMOT algorithms mainly focus on the combination of two modalities (e.g., RGB+Depth, RGB+Thermal Infrared, RGB+Language) and fail to fully utilize more modal information to improve tracking performance. 3. **Lack of comprehensive reviews**: Most existing MMOT reviews mainly focus on the combination of two modalities and lack a comprehensive survey of tracking tasks involving more modalities. To address these issues, the paper undertakes the following work: - **Classification and definition of different MMOT tasks**: The existing MMOT tasks are classified into 5 categories, namely RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X) tracking, and informal definitions for each type of task are provided. - **Analysis and summary of mainstream algorithms**: For each type of MMOT task, the paper analyzes widely used datasets and the technical paradigms of mainstream tracking algorithms, such as self-supervised learning, prompt learning, knowledge distillation, generative models, and state-space models. - **Providing the latest research progress**: A continuously updated list of MMOT papers is maintained to help researchers keep track of the latest developments in the field. Through this work, the paper aims to provide researchers with a comprehensive perspective on the current research progress, major achievements, existing problems, and future development directions in the MMOT field.