Abstract:Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient performance of multimodal visual object tracking (VOT) under complex conditions. Specifically: 1. **Limitations of the RGB modality**: Traditional RGB - based visual object tracking suffers from performance degradation under complex conditions (such as extreme weather, poor imaging quality, sensor failures, etc.), which is especially crucial in safety - sensitive application scenarios like autonomous driving. Therefore, researchers have begun to explore the combination of multimodal images to obtain more comprehensive information and improve the robustness and accuracy of tracking. 2. **Modal gap and RGB - dominance problem**: Early research mainly focused on the complete fine - tuning of the RGB modality, which is not only inefficient but also leads to insufficient generalization representation ability due to the scarcity of multimodal data. Although recent research has utilized prompt tuning to transfer pre - trained RGB models to multimodal data, the modal gap limits the recall of pre - trained knowledge, and the dominance of the RGB modality still exists, failing to fully utilize the information of other modalities. 3. **Limitations of existing methods**: - **Symmetric framework**: Although the fully - fine - tuned symmetric framework has a symmetric information flow, it introduces a large number of training parameters and is prone to overfitting, especially in the case of scarce multimodal data. - **Asymmetric framework**: The asymmetric framework transfers pre - trained RGB models to other modalities through prompt tuning. Although it is parameter - efficient, it depends on the RGB modality and cannot fully utilize the information of other modalities, resulting in insufficient robustness under extreme conditions. To solve the above problems, the paper proposes a new method named SDSTrack. The main contributions of SDSTrack include: 1. **Symmetric Multimodal Adaptation (SMA)**: Through lightweight multimodal adaptation, efficiently transfer the feature extraction ability of the pre - trained model from the RGB modality to other modalities and effectively fuse multimodal features. 2. **Complementary Masked Patch Distillation (CMPD)**: Enhance the robustness and accuracy of the model under extreme conditions through self - distillation learning. 3. **Experimental verification**: Extensive experiments show that SDSTrack performs excellently in various multimodal tracking scenarios (such as RGB + Depth, RGB + Thermal, RGB + Event), especially under extreme conditions. In conclusion, SDSTrack aims to improve the robustness and accuracy of multimodal visual object tracking under complex conditions through efficient parameter fine - tuning and symmetric multimodal fusion strategies.

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

MATI: Multimodal Adaptive Tracking Integrator for Robust Visual Object Tracking

SSTtrack: A Unified Hyperspectral Video Tracking Framework via Modeling Spectral-Spatial-Temporal Conditions

Visual Object Tracking across Diverse Data Modalities: A Review

Robust Visual Tracking Via Multiple Discriminative Models with Object Proposals

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

Single-Model and Any-Modality for Video Object Tracking

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

EMTrack: Efficient Multimodal Object Tracking

Visual Object Tracking with Multi-Frame Distractor Suppression

SUTrack: Towards Simple and Unified Single Object Tracking

Robust RGB-T Tracking Via Adaptive Modality Weight Correlation Filters and Cross-modality Learning

Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation

Deep Spatial and Temporal Network for Robust Visual Object Tracking

Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking

HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision

Bi-directional Adapter for Multi-modal Tracking

Multi-Adapter RGBT Tracking.

SwapTrack: Enhancing RGB-T Tracking Via Learning from Paired and Single-Modal Data