Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation

Andong Lu,Jiacong Zhao,Chenglong Li,Yun Xiao,Bin Luo
2024-10-15
Abstract:Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. In particular, we introduce two student networks and employ the style distillation loss to make their style features consistent as much as possible. Through alleviating the style difference of two student networks, we can break modality gap of different modalities well. However, the distillation of style features might harm to the content representations of two modalities in student networks. To handle this issue, we take original RGB and TIR networks as the teachers, and distill their content knowledge into two student networks respectively by the style-content orthogonal feature decoupling scheme. We couple the above two distillation processes in an online optimization framework to form new feature representations of RGB and thermal modalities without modality gap. In addition, we design a masked modeling strategy and a multi-modal candidate token elimination strategy into CKD to improve tracking robustness and efficiency respectively. Extensive experiments on five standard RGBT tracking datasets validate the effectiveness of the proposed method against state-of-the-art methods while achieving the fastest tracking speed of 96.4 FPS. Code available at <a class="link-external link-https" href="https://github.com/Multi-Modality-Tracking/CKD" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the modality gap problem in RGBT (visible light and thermal infrared) tracking. Specifically, due to the different imaging bands, RGB and thermal infrared images have significant differences in appearance styles, which affects the performance and efficiency of multi - modal tracking. Existing RGBT tracking methods often overlook this crucial issue. ### Core contributions of the paper: 1. **Propose a novel Coupled Knowledge Distillation (CKD) framework**: Break the modality gap by eliminating the style differences between RGB and thermal infrared images to achieve high - performance RGBT tracking. 2. **Design a style - content coupled distillation scheme**: Based on the style - content orthogonal feature decoupling strategy, effectively eliminate the modality gap while avoiding damaging the modality content representation. 3. **Introduce a mask modeling strategy**: Enhance the learning ability of modality content representation, especially in challenging scenarios. 4. **Design a multi - modal candidate label elimination strategy**: Improve the robustness and efficiency of tracking by considering the information of the two modalities cooperatively. ### Influence of modality gap and solutions: - **Influence of modality style on modality gap**: After removing the style information through instance normalization, the modality gap is significantly reduced, indicating that modality style is an important factor affecting the modality gap. - **Style distillation**: In order to make the style features of the two modalities as consistent as possible, calculate and minimize the mean square error (MSE) of the style features between the two student branches. The formula is as follows: \[ L_{SD}=\frac{1}{L}\sum_{l = 1}^{L}\left((\mu_s^{(l)}-\mu_t^{(l)})^2+(\sigma_s^{(l)}-\sigma_t^{(l)})^2\right) \] where \(\mu\) and \(\sigma\) represent the mean and standard deviation of the features respectively, and \(L\) is the number of layers. - **Content distillation**: To ensure the stability of the modality content representation, use the classical instance normalization operation to obtain the content features and calculate the similarity of the content features between the teacher and student branches. For the thermal infrared modality, the content distillation loss \(L_{CD}^{tir}\) can be expressed as: \[ L_{CD}^{tir}=\frac{1}{L}\sum_{l = 1}^{L}\left(\hat{F}_l^{tir}-\hat{f}_l^{tir}\right)^2 \] The total content distillation loss is: \[ L_{CD}=L_{CD}^{tir}+L_{CD}^{rgb} \] ### Experimental results: - **Performance improvement**: On four mainstream public datasets, the CKD method has achieved state - of - the - art results, with PR/SR scores increased by 1.6%/2.7%, 1.6%/3.0%, 3.0%/2.0% and 10.1%/11.1% respectively. - **Speed improvement**: Compared with existing methods, the tracking speed of CKD has increased by 60.2 FPS, reaching 96.4 FPS. In conclusion, this paper effectively solves the modality gap problem in RGBT tracking by introducing the coupled knowledge distillation framework, significantly improving the tracking performance and efficiency.