Abstract:Unmanned aerial vehicle (UAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing UAV-based BSOD models limits their applicability to real-world UAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the UAV RGB-T 2400 and three weakly aligned datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to sixteen state-of-the-art BSOD models across most evaluation metrics. In addition, our ablation studies further verify AlignSal's potential in boosting the performance of existing aligned BSOD models on UAV-based unaligned data. The code is available at: <a class="link-external link-https" href="https://github.com/JoshuaLPF/AlignSal" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address several key issues in UAV-based (UAV) non-aligned bimodal salient object detection (BSOD): 1. **High computational cost**: Existing UAV-based BSOD models have high computational complexity, making real-time processing difficult in practical applications. In particular, the modality alignment and fusion modules require a large amount of computational resources, limiting the overall inference speed of the model. 2. **Spatial shift of small-scale objects**: Due to UAVs operating at high altitudes, the captured objects are relatively small in scale, leading to significant spatial shifts of small-scale objects between different modalities. Existing alignment strategies (such as convolutional attention operations) are difficult to effectively handle these large spatial shifts due to their limited receptive fields. 3. **Handling of non-aligned data**: Most existing bimodal object detection datasets are manually aligned, avoiding many challenges in real-world scenarios. Therefore, models designed based on these aligned datasets often perform poorly when applied to non-aligned data. To address these issues, the authors propose an efficient and real-time model called AlignSal, which has the following features: - **Semantic Contrastive Alignment Loss (SCAL)**: Aligns RGB and thermal imaging modalities at the semantic level through a contrastive learning approach. SCAL refines the modalities by pulling similar local features closer and pushing dissimilar features apart in the embedding space, improving alignment without increasing the computational burden during inference. - **Synchronous Alignment Fusion (SAF) module**: Utilizes Fast Fourier Transform (FFT) to align and fuse bimodal features in both channel and spatial dimensions. SAF captures spatial shifts and bimodal salient cues hierarchically through multiple sets of global filtering mechanisms and integrates them into the final features, ensuring real-time performance with low complexity. Through these innovations, AlignSal not only outperforms 16 existing state-of-the-art BSOD models on multiple evaluation metrics but also achieves faster inference speed (152.5% faster than the current top model MROS), reduces the number of parameters by 70.0%, and decreases floating-point operations by 49.4%.

Efficient Fourier Filtering Network with Contrastive Learning for UAV-based Unaligned Bi-modal Salient Object Detection

Salient Object Detection with Bayesian Inference Based on Radar and Camera Fusion Used in UAV Obstacle Avoidance

Cross-Modal Oriented Object Detection of UAV Aerial Images Based on Image Feature

Learnable Cross-Scale Sparse Attention Guided Feature Fusion for UAV Object Detection

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network for TIR Wildlife Detection in UAV Imagery

BISSIAM: Bispectrum Siamese Network Based Contrastive Learning for UAV Anomaly Detection

SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment

DREB-Net: Dual-stream Restoration Embedding Blur-feature Fusion Network for High-mobility UAV Object Detection

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

Deformable Convolution-Guided Multiscale Feature Learning and Fusion for UAV Object Detection

Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection

Localization, balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images

Shared-Weight-Based Multi-Dimensional Feature Alignment Network for Oriented Object Detection in Remote Sensing Imagery

Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection

Deep Fourier-embedded Network for Bi-modal Salient Object Detection

ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

Adaptive Feature Fusion and Improved Attention Mechanism-Based Small Object Detection for UAV Target Tracking

Multi-Branch Parallel Networks for Object Detection in High-Resolution UAV Remote Sensing Images