Awesome Multi-modal Object Tracking

Chunhui Zhang,Li Liu,Hao Wen,Xi Zhou,Yanfeng Wang

2024-05-31

Abstract:Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, \eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (\eg WebUAV-3M) and vision-depth-language (\eg UniMod1K). To track the latest progress in MMOT, we conduct a comprehensive investigation in this report. Specifically, we first divide existing MMOT tasks into five main categories, \ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X), where X can be any modality, such as language, depth, and event. Then, we analyze and summarize each MMOT task, focusing on widely used datasets and mainstream tracking algorithms based on their technical paradigms (\eg self-supervised learning, prompt learning, knowledge distillation, generative models, and state space models). Finally, we maintain a continuously updated paper list for MMOT at <a class="link-external link-https" href="https://github.com/983632847/Awesome-Multimodal-Object-Tracking" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address several key issues in multimodal object tracking (MMOT): 1. **Limitations of existing algorithms**: Although RGB-based object tracking methods have made significant progress over the past 10 years, they still fail to achieve precise and robust tracking in some complex situations, such as illumination changes, fast motion, occlusion, and appearance variations. 2. **Insufficient utilization of multimodal data**: Existing MMOT algorithms mainly focus on the combination of two modalities (e.g., RGB+Depth, RGB+Thermal Infrared, RGB+Language) and fail to fully utilize more modal information to improve tracking performance. 3. **Lack of comprehensive reviews**: Most existing MMOT reviews mainly focus on the combination of two modalities and lack a comprehensive survey of tracking tasks involving more modalities. To address these issues, the paper undertakes the following work: - **Classification and definition of different MMOT tasks**: The existing MMOT tasks are classified into 5 categories, namely RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X) tracking, and informal definitions for each type of task are provided. - **Analysis and summary of mainstream algorithms**: For each type of MMOT task, the paper analyzes widely used datasets and the technical paradigms of mainstream tracking algorithms, such as self-supervised learning, prompt learning, knowledge distillation, generative models, and state-space models. - **Providing the latest research progress**: A continuously updated list of MMOT papers is maintained to help researchers keep track of the latest developments in the field. Through this work, the paper aims to provide researchers with a comprehensive perspective on the current research progress, major achievements, existing problems, and future development directions in the MMOT field.

Awesome Multi-modal Object Tracking

Online Multi-Object Tracking from A Bird's-Eye View by Fusion of Millimeter-Wave Radar and Vision

Multi-modal visual tracking: Review and experimental comparison

MAT: Motion-Aware Multi-Object Tracking

Multi-Granularity Language-Guided Multi-Object Tracking

Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

Deep learning and multi-modal fusion for real-time multi-object tracking: Algorithms, challenges, datasets, and comparative study

Exploring the State-of-the-Art in Multi-Object Tracking: A Comprehensive Survey, Evaluation, Challenges, and Future Directions

CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

LaMOT: Language-Guided Multi-Object Tracking

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Yolo-3DMM for Simultaneous Multiple Object Detection and Tracking in Traffic Scenarios

Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

Multiple object tracking: A literature review

A Review of Detection-Related Multiple Object Tracking in Recent Times

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking