Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention

Pujie Zhao,Xia Ye,Ziang Du
DOI: https://doi.org/10.3390/s24134098
IF: 3.9
2024-06-24
Sensors
Abstract:In complex environments a single visible image is not good enough to perceive the environment, this paper proposes a novel dual-stream real-time detector designed for target detection in extreme environments such as nighttime and fog, which is able to efficiently utilise both visible and infrared images to achieve Fast All-Weatherenvironment sensing (FAWDet). Firstly, in order to allow the network to process information from different modalities simultaneously, this paper expands the state-of-the-art end-to-end detector YOLOv8, the backbone is expanded in parallel as a dual stream. Then, for purpose of avoid information loss in the process of network deepening, a cross-modal feature enhancement module is designed in this study, which enhances each modal feature by cross-modal attention mechanisms, thus effectively avoiding information loss and improving the detection capability of small targets. In addition, for the significant differences between modal features, this paper proposes a three-stage fusion strategy to optimise the feature integration through the fusion of spatial, channel and overall dimensions. It is worth mentioning that the cross-modal feature fusion module adopts an end-to-end training approach. Extensive experiments on two datasets validate that the proposed method achieves state-of-the-art performance in detecting small targets. The cross-modal real-time detector in this study not only demonstrates excellent stability and robust detection performance, but also provides a new solution for target detection techniques in extreme environments.
engineering, electrical & electronic,instruments & instrumentation,chemistry, analytical
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue where a single visible light image is insufficient for effective environmental perception in complex conditions (such as nighttime and foggy weather). Specifically, the study proposes a novel dual-stream real-time detector that utilizes both visible light and infrared images to achieve all-weather environmental perception, with a particular focus on small object detection in extreme environments. ### Main Contributions 1. **Dual-Stream Real-Time Detector**: A dual-stream real-time detector is proposed, which can perform stable object detection in extreme environments (such as nighttime and foggy weather) using both visible light and infrared images. 2. **Cross-Modal Attention Mechanism**: A cross-modal feature enhancement module is designed, which filters and enhances features of each modality through a cross-modal attention mechanism, avoiding information loss during network deepening and improving the detection of weak targets. 3. **Three-Stage Fusion Strategy**: To address the significant differences between different modal features, a three-stage fusion strategy is designed to optimize feature fusion from spatial, channel, and overall dimensions. Notably, the cross-modal feature fusion module adopts an end-to-end training method. 4. **Experimental Validation**: Extensive experiments were conducted on 2 datasets, validating the state-of-the-art (SOTA) performance of this method in remote sensing object detection. Additionally, this cross-modal real-time detector not only demonstrates excellent stability and robustness but also provides a new solution for object detection technology in extreme environments.