Efficient Fourier Filtering Network with Contrastive Learning for UAV-based Unaligned Bi-modal Salient Object Detection

Pengfei Lyu,Pak-Hei Yeung,Xiufei Cheng,Xiaosheng Yu,Chengdong Wu,Jagath C. Rajapakse
2024-11-06
Abstract:Unmanned aerial vehicle (UAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing UAV-based BSOD models limits their applicability to real-world UAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the UAV RGB-T 2400 and three weakly aligned datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to sixteen state-of-the-art BSOD models across most evaluation metrics. In addition, our ablation studies further verify AlignSal's potential in boosting the performance of existing aligned BSOD models on UAV-based unaligned data. The code is available at: <a class="link-external link-https" href="https://github.com/JoshuaLPF/AlignSal" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address several key issues in UAV-based (UAV) non-aligned bimodal salient object detection (BSOD): 1. **High computational cost**: Existing UAV-based BSOD models have high computational complexity, making real-time processing difficult in practical applications. In particular, the modality alignment and fusion modules require a large amount of computational resources, limiting the overall inference speed of the model. 2. **Spatial shift of small-scale objects**: Due to UAVs operating at high altitudes, the captured objects are relatively small in scale, leading to significant spatial shifts of small-scale objects between different modalities. Existing alignment strategies (such as convolutional attention operations) are difficult to effectively handle these large spatial shifts due to their limited receptive fields. 3. **Handling of non-aligned data**: Most existing bimodal object detection datasets are manually aligned, avoiding many challenges in real-world scenarios. Therefore, models designed based on these aligned datasets often perform poorly when applied to non-aligned data. To address these issues, the authors propose an efficient and real-time model called AlignSal, which has the following features: - **Semantic Contrastive Alignment Loss (SCAL)**: Aligns RGB and thermal imaging modalities at the semantic level through a contrastive learning approach. SCAL refines the modalities by pulling similar local features closer and pushing dissimilar features apart in the embedding space, improving alignment without increasing the computational burden during inference. - **Synchronous Alignment Fusion (SAF) module**: Utilizes Fast Fourier Transform (FFT) to align and fuse bimodal features in both channel and spatial dimensions. SAF captures spatial shifts and bimodal salient cues hierarchically through multiple sets of global filtering mechanisms and integrates them into the final features, ensuring real-time performance with low complexity. Through these innovations, AlignSal not only outperforms 16 existing state-of-the-art BSOD models on multiple evaluation metrics but also achieves faster inference speed (152.5% faster than the current top model MROS), reduces the number of parameters by 70.0%, and decreases floating-point operations by 49.4%.