Abstract:Integrating multispectral data in object detection, especially visible and infrared images, has received great attention in recent years. Since visible (RGB) and infrared (IR) images can provide complementary information to handle light variations, the paired images are used in many fields, such as multispectral pedestrian detection, RGB-IR crowd counting and RGB-IR salient object detection. Compared with natural RGB-IR images, we find detection in aerial RGB-IR images suffers from cross-modal weakly misalignment problems, which are manifested in the position, size and angle deviations of the same object. In this paper, we mainly address the challenge of cross-modal weakly misalignment in aerial RGB-IR images. Specifically, we firstly explain and analyze the cause of the weakly misalignment problem. Then, we propose a Translation-Scale-Rotation Alignment (TSRA) module to address the problem by calibrating the feature maps from these two modalities. The module predicts the deviation between two modality objects through an alignment process and utilizes Modality-Selection (MS) strategy to improve the performance of alignment. Finally, a two-stream feature alignment detector (TSFADet) based on the TSRA module is constructed for RGB-IR object detection in aerial images. With comprehensive experiments on the public DroneVehicle datasets, we verify that our method reduces the effect of the cross-modal misalignment and achieve robust detection results.
What problem does this paper attempt to address?
This paper attempts to address the issue of cross-modal weakly misalignment in aerial RGB-IR images. Specifically, due to deviations in position, size, and angle between visible light (RGB) and infrared (IR) images, it leads to difficulties in detection tasks. The paper points out that in natural images, RGB-IR images have been widely used in fields such as multispectral pedestrian detection, RGB-IR crowd counting, and RGB-IR salient object detection. However, in aerial images, this cross-modal weakly misalignment problem is more prominent because the direction of objects is arbitrary, and the deviations in position, size, and angle are coupled, making the alignment operation more complex.
To solve this problem, the authors propose a Translation-Scale-Rotation Alignment (TSRA) module to reduce these deviations by calibrating the feature maps of the two modalities. Additionally, a Modality-Selection (MS) strategy is introduced to select the appropriate annotation as the reference bounding box. Finally, based on the TSRA module, a Two-Stream Feature Alignment Detector (TSFADet) is constructed for object detection in aerial RGB-IR images.
### Main Contributions:
1. **For the first time, the cross-modal weakly misalignment problem in rotational object detection of aerial RGB-IR images is proposed and analyzed**.
2. **The TSRA module is proposed**, which includes the alignment process and the MS strategy, capable of translation, scaling, and rotation calibration of the feature maps of the two modal objects.
3. **Multi-task Jitter is designed**, further improving the robustness of the model.
4. **TSFADet is constructed**, and validated on the DroneVehicle dataset, showing excellent performance in solving the cross-modal weakly misalignment problem.
### Experimental Results:
- Experiments on the DroneVehicle dataset show that TSFADet outperforms existing single-modal and multi-modal detectors on multiple metrics.
- Ablation experiments demonstrate the effectiveness of the TSRA module, MS strategy, and Multi-task Jitter.
In summary, this paper proposes an effective solution to the cross-modal weakly misalignment problem in aerial RGB-IR images and demonstrates its effectiveness in improving object detection performance through experiments.