SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection

Shuhan Dong,Yunsong Li,Weiying Xie,Jiaqing Zhang,Jiayuan Tian,Danian Yang,Jie Lei
2024-10-15
Abstract:Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. By learning long-term dependencies, Transformer can effectively integrate multimodal features in the feature extraction stage, which greatly improves the performance of multimodal object detection. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer's guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features' detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the feature fusion problem in multimodal object detection. Specifically, current methods, when using Transformer for feature fusion, usually simply stack Transformer - guided fusion techniques without fully exploring their ability to extract features at different network depth layers, which limits the improvement of detection performance. In addition, existing methods perform well in extracting detailed information of shallow - level features but are insufficient in extracting deep - level semantic information. To solve these problems, the authors propose SeaDATE (Semantic Alignment via Contrast Learning for Multimodal Object Detection), a multimodal object detection method based on dual - attention mechanisms and contrast learning. The following are the main contributions of this method: 1. **Designed a dual - attention feature fusion module (DTF)**: By introducing spatial - attention and channel - attention mechanisms, it effectively integrates local and global information and enhances the fusion effect of multimodal features. 2. **Introduced a contrast learning module (CL)**: This module is located at the deepest layer of the network and aims to make up for the deficiency of Transformer in extracting deep - level semantic features and promote the effective utilization of cross - modal information. 3. **Significantly improves the accuracy of object detection**: Experimental results on public datasets such as FLIR, LLVIP, and M3FD show that this method outperforms other leading techniques in detection performance. ### Method Overview #### Dual - Attention Transformer Fusion Module (DTF) - **Spatial Multi - Head Attention**: Captures complementary information between RGB and infrared images through pixel - level tokens, enhancing the information interaction of spatial positions. \[ S = \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \] - **Channel - Group Attention**: Captures global information through the channel dimension. Each channel token has globality in the spatial dimension, ensuring that all spatial positions are considered, thereby enhancing the exchange of global information. \[ C_i = \text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_i^\top K_i}{C_g}\right)V_i^\top \] #### Contrast Learning Module (CL) - **Dual - Branch Encoder and Projection Head**: Learns feature representations by minimizing the distance between positive samples and maximizing the distance between negative samples. \[ L_c = -\log\frac{\exp(z_q \cdot z_k / \tau)}{\exp(z_q \cdot z_k / \tau) + \sum_{i = 1}^{n} \exp(z_q \cdot z_i / \tau)} \] - **Queue Dictionary Storage**: Expands the dictionary size through a queue mechanism, avoids the limitations of GPU memory and computing power, and improves the generalization ability of the model. #### Loss Function - **Overall Loss**: Consists of the detection loss \(L_o\) and the contrast learning loss \(L_c\). \[ L = \alpha_1 L_o + \alpha_2 L_c \] where \(L_o\) includes the losses of determining whether an object exists \(L_{\text{obj}}\), object location \(L_{\text{loc}}\), and object classification \(L_{\text{cls}}\). \[ L_o=\lambda_{\text{obj}}\sum_{h = 0} a_h L_{\text{obj}}+\lambda_{\text{loc}}\sum_{h = 0} b_h L_{\text{loc}}+\lambda