Abstract:Visual object tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multimodal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called modality-aware fusion network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module. The adaptive weighting module predicts fusion weights to dynamically adjust the contribution of each modality, while the modality-specific representation module captures discriminative features specific to RGB and NIR modalities. MAFNet offers great flexibility as it can effortlessly integrate into diverse tracking frameworks. With its simplicity, effectiveness, and efficiency, MAFNet outperforms state-of-the-art methods in cross-modal object tracking. To validate the effectiveness of our algorithm and overcome the scarcity of data in this field, we introduce CMOTB, a comprehensive and extensive benchmark dataset for cross-modal object tracking. CMOTB consists of 61 categories and 1000 video sequences, comprising a total of over 799K frames. We believe that our proposed method and dataset offer a strong foundation for advancing cross-modal object-tracking research. The dataset, toolkit, experimental data, and source code will be publicly available at: https://github.com/mmic-lcl/ Datasets-and-benchmark-code.

Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

Online Multi-Object Tracking from A Bird's-Eye View by Fusion of Millimeter-Wave Radar and Vision

Awesome Multi-modal Object Tracking

MF-Net: A Multimodal Fusion Model for Fast Multi-object Tracking

MATI: Multimodal Adaptive Tracking Integrator for Robust Visual Object Tracking

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Exploring fusion strategies for accurate RGBT visual object tracking

CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion

Object fusion tracking for RGB-T images via channel swapping and modal mutual attention

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Deep learning and multi-modal fusion for real-time multi-object tracking: Algorithms, challenges, datasets, and comparative study

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

X Modality Assisting RGBT Object Tracking

Camouflaged Object Tracking: A Benchmark

Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Method

Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking