Abstract:Visual object tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multimodal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called modality-aware fusion network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module. The adaptive weighting module predicts fusion weights to dynamically adjust the contribution of each modality, while the modality-specific representation module captures discriminative features specific to RGB and NIR modalities. MAFNet offers great flexibility as it can effortlessly integrate into diverse tracking frameworks. With its simplicity, effectiveness, and efficiency, MAFNet outperforms state-of-the-art methods in cross-modal object tracking. To validate the effectiveness of our algorithm and overcome the scarcity of data in this field, we introduce CMOTB, a comprehensive and extensive benchmark dataset for cross-modal object tracking. CMOTB consists of 61 categories and 1000 video sequences, comprising a total of over 799K frames. We believe that our proposed method and dataset offer a strong foundation for advancing cross-modal object-tracking research. The dataset, toolkit, experimental data, and source code will be publicly available at: https://github.com/mmic-lcl/ Datasets-and-benchmark-code.

Novel Pipeline Integrating Cross-Modality and Motion Model for Nearshore Multi-Object Tracking in Optical Video Surveillance

Online Multi-Object Tracking from A Bird's-Eye View by Fusion of Millimeter-Wave Radar and Vision

Cross-Modality 3D Multi-Object Tracking under Adverse Weather Via Adaptive Hard Sample Mining

Robust Multi-Modality Multi-Object Tracking

MAT: Motion-Aware Multi-Object Tracking

CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion

MOMT: A Maritime Real-Time Visual Multi-Object Tracking Algorithm Based on Unmanned Surface Vehicles

Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

An Integrated Detection and Multi-Object Tracking Pipeline for Satellite Video Analysis of Maritime and Aerial Objects

Spatial-Semantic and Temporal Attention Mechanism-Based Online Multi-Object Tracking

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

Spatial-Attention Location-Aware Multi-Object Tracking

Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality Collaboration

Poly-MOT: A Polyhedral Framework for 3D Multi-Object Tracking.

Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

MotionTrack: rethinking the motion cue for multiple object tracking in USV videos

Multi-Granularity Language-Guided Multi-Object Tracking

Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue

A New Architecture for Neural Enhanced Multiobject Tracking

Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

Awesome Multi-modal Object Tracking