Abstract:Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. By learning long-term dependencies, Transformer can effectively integrate multimodal features in the feature extraction stage, which greatly improves the performance of multimodal object detection. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer's guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features' detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the feature fusion problem in multimodal object detection. Specifically, current methods, when using Transformer for feature fusion, usually simply stack Transformer - guided fusion techniques without fully exploring their ability to extract features at different network depth layers, which limits the improvement of detection performance. In addition, existing methods perform well in extracting detailed information of shallow - level features but are insufficient in extracting deep - level semantic information. To solve these problems, the authors propose SeaDATE (Semantic Alignment via Contrast Learning for Multimodal Object Detection), a multimodal object detection method based on dual - attention mechanisms and contrast learning. The following are the main contributions of this method: 1. **Designed a dual - attention feature fusion module (DTF)**: By introducing spatial - attention and channel - attention mechanisms, it effectively integrates local and global information and enhances the fusion effect of multimodal features. 2. **Introduced a contrast learning module (CL)**: This module is located at the deepest layer of the network and aims to make up for the deficiency of Transformer in extracting deep - level semantic features and promote the effective utilization of cross - modal information. 3. **Significantly improves the accuracy of object detection**: Experimental results on public datasets such as FLIR, LLVIP, and M3FD show that this method outperforms other leading techniques in detection performance. ### Method Overview #### Dual - Attention Transformer Fusion Module (DTF) - **Spatial Multi - Head Attention**: Captures complementary information between RGB and infrared images through pixel - level tokens, enhancing the information interaction of spatial positions. \[ S = \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \] - **Channel - Group Attention**: Captures global information through the channel dimension. Each channel token has globality in the spatial dimension, ensuring that all spatial positions are considered, thereby enhancing the exchange of global information. \[ C_i = \text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_i^\top K_i}{C_g}\right)V_i^\top \] #### Contrast Learning Module (CL) - **Dual - Branch Encoder and Projection Head**: Learns feature representations by minimizing the distance between positive samples and maximizing the distance between negative samples. \[ L_c = -\log\frac{\exp(z_q \cdot z_k / \tau)}{\exp(z_q \cdot z_k / \tau) + \sum_{i = 1}^{n} \exp(z_q \cdot z_i / \tau)} \] - **Queue Dictionary Storage**: Expands the dictionary size through a queue mechanism, avoids the limitations of GPU memory and computing power, and improves the generalization ability of the model. #### Loss Function - **Overall Loss**: Consists of the detection loss \(L_o\) and the contrast learning loss \(L_c\). \[ L = \alpha_1 L_o + \alpha_2 L_c \] where \(L_o\) includes the losses of determining whether an object exists \(L_{\text{obj}}\), object location \(L_{\text{loc}}\), and object classification \(L_{\text{cls}}\). \[ L_o=\lambda_{\text{obj}}\sum_{h = 0} a_h L_{\text{obj}}+\lambda_{\text{loc}}\sum_{h = 0} b_h L_{\text{loc}}+\lambda

SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

Improved Object Detection with Content and Position Separation in Transformer

Multi-Modal Target Detection Method Based on Adaptive Feature Search

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment

Semantic-aligned Fusion Transformer for One-shot Object Detection

Cross-Modality Fusion Transformer for Multispectral Object Detection

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

Dual-Branch Feature Fusion Network Based Cross-Modal Enhanced CNN and Transformer for Hyperspectral and LiDAR Classification

Classification of hyperspectral and LiDAR data by transformer-based enhancement

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

Combining transformer global and local feature extraction for object detection

Dual-Stream Feature Collaboration Perception Network for Salient Object Detection in Remote Sensing Images

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation