Abstract:Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at <a class="link-external link-https" href="https://github.com/jiawen-zhu/ViPT" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address several key issues in multi-modal object tracking:
1. **Data Scarcity**: Multi-modal tracking tasks lack large-scale datasets, which limits the generalization ability and performance improvement of models. For example, commonly used multi-modal tracking datasets (such as DepthTrack, LasHeR, VisEvent) have an order of magnitude fewer training sequences compared to single-modal RGB tracking datasets (such as GOT-10k, TrackingNet, LaSOT).
2. **Limitations of Full Fine-Tuning**: Existing multi-modal tracking methods usually adapt pre-trained RGB models to downstream tasks through full fine-tuning, but this approach has the following problems:
- **High Time Cost**: Full fine-tuning requires a lot of time and computational resources.
- **Large Parameter Storage Burden**: The model parameters after full fine-tuning are large, which is not conducive to deployment and application.
- **Poor Generalization Ability**: Due to the small scale of downstream datasets, full fine-tuning is prone to overfitting and cannot fully utilize the knowledge of large-scale pre-trained models.
3. **Utilization of Modal Complementarity**: The key to multi-modal tracking is how to effectively utilize the complementary information between different modalities to improve the robustness and accuracy of tracking. Existing methods usually add additional network branches to handle auxiliary modal inputs, but this increases the complexity and parameter count of the model.
### Solution
To address the above issues, the paper proposes the **Visual Prompt Multi-Modal Tracking (ViPT)** framework, with the following main features:
1. **Visual Prompt Learning**: ViPT adapts the frozen pre-trained base model to various downstream multi-modal tracking tasks by learning modality-related visual prompts. This method introduces only a small number of trainable parameters (less than 1% of the model parameters), achieving efficient and parameter-friendly model adaptation.
2. **Modal Complementarity Prompter (MCP)**: ViPT designs a modal complementarity prompter to generate effective visual prompts. MCP integrates multi-modal inputs into the base model through lightweight modules (such as 1×1 convolution layers), learning the complementary relationships between different modalities.
3. **General Framework**: ViPT is a general framework applicable to various downstream multi-modal tracking tasks, including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Experimental results show that ViPT achieves state-of-the-art performance in multiple downstream tasks while maintaining parameter efficiency.
### Experimental Validation
The paper conducts extensive experiments on multiple multi-modal tracking datasets to verify the effectiveness and generalization ability of ViPT:
- **DepthTrack**: ViPT significantly outperforms other methods in terms of precision and recall, achieving an F-score of 59.4%, which is 6.5% higher than the base model.
- **VOT-RGBD2022**: ViPT achieves an expected average overlap (EAO) of 0.721, which is 4.5% higher than the base model.
- **RGBT234**: ViPT achieves 61.7% and 83.5% in MSR and MPR metrics, respectively, surpassing other RGB-T trackers.
- **LasHeR**: ViPT exceeds the second place by 10.5% and 11.3% in success rate and precision, respectively, demonstrating its strong adaptability in complex scenarios.
- **VisEvent**: ViPT improves the success rate and precision by 5.8% and 6.3%, respectively, over the second place, showing excellent performance.
In summary, ViPT effectively addresses data scarcity and the limitations of full fine-tuning in multi-modal object tracking through visual prompt learning and the modal complementarity prompter, achieving efficient and high-performance multi-modal tracking.