Abstract:The rapid development of deep learning provides a significant improvement of salient object detection combining both RGB and thermal images. However, existing deep learning-based models suffer from two major shortcomings. First, the computation and memory demands of Transformer-based models with quadratic complexity are unbearable, especially in handling high-resolution bi-modal feature fusion. Second, even if learning converges to an ideal solution, there remains a frequency gap between the prediction and ground truth. Therefore, we propose a purely fast Fourier transform-based model, namely deep Fourier-embedded network (DFENet), for learning bi-modal information of RGB and thermal images. On one hand, fast Fourier transform efficiently fetches global dependencies with low complexity. Inspired by this, we design modal-coordinated perception attention to fuse the frequency gap between RGB and thermal modalities with multi-dimensional representation enhancement. To obtain reliable detailed information during decoding, we design the frequency-decomposed edge-aware module (FEM) to clarify object edges by deeply decomposing low-level features. Moreover, we equip proposed Fourier residual channel attention block in each decoder layer to prioritize high-frequency information while aligning channel global relationships. On the other hand, we propose co-focus frequency loss (CFL) to steer FEM towards minimizing the frequency gap. CFL dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing the bi-modal edge information in the Fourier domain. This frequency-level refinement of edge features further contributes to the quality of the final pixel-level prediction. Extensive experiments on four bi-modal salient object detection benchmark datasets demonstrate our proposed DFENet outperforms twelve existing state-of-the-art models.

What problem does this paper attempt to address?

This paper attempts to solve two main problems existing in current deep - learning models in bi - modal saliency object detection (BSOD): 1. **Excessively high computational and memory requirements**: Transformer - based models, due to their quadratic complexity, have unbearable computational and memory requirements when processing high - resolution bi - modal feature fusion. This limits the efficiency and scalability of the models in practical applications. 2. **Frequency gap problem**: Even if the model converges to the ideal solution, there is still a frequency gap between the prediction results and the ground - truth labels, especially during the edge feature reconstruction process, where high - frequency information is easily ignored or lost. To solve these problems, the authors propose a new network based on the fast Fourier transform (FFT) - the Deep Fourier - embedded Network (DFENet). The main contributions of DFENet include: - **Global relationship alignment**: Utilizing the efficient global representation ability of FFT, DFENet can achieve global relationship alignment at each stage while minimizing memory consumption and computational complexity. - **Modal - coordinated Perceptual Attention (MPA)**: Through the re - embedding strategy, it models the complementary information between RGB and thermal imaging modalities in the spatial and channel dimensions, thereby obtaining more accurate cross - modal fusion features. - **Frequency - domain Decomposed Edge - aware Module (FEM)**: FEM is designed to clarify object edge features from cluttered backgrounds and extract edge frequencies through multi - step decomposition to guide the fusion of multi - level features. - **Fourier Residual Channel Attention Block (FRCAB)**: FRCAB is introduced in each decoding layer to emphasize high - frequency information and global dependencies in the channel dimension, ensuring the effectiveness and consistency of features. - **Confocal Frequency Loss (CFL)**: By cross - referring to the edge information of RGB and thermal imaging modalities, it dynamically weights difficult frequencies and guides FEM to optimize edge feature reconstruction in the frequency domain, further improving the quality of the final pixel - level prediction. Through these innovations, DFENet outperforms the existing 12 state - of - the - art models on four bi - modal saliency object detection benchmark datasets, demonstrating its superior performance in handling complex scenes.

Deep Fourier-embedded Network for Bi-modal Salient Object Detection

Cross-Collaborative Fusion-Encoder Network for Robust RGB-Thermal Salient Object Detection.

C $^{2}$ DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection

CFRNet: Cross-Attention-Based Fusion and Refinement Network for Enhanced RGB-T Salient Object Detection

Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection

CAFCNet: Cross-modality asymmetric feature complement network for RGB-T salient object detection

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

HFENet: Hybrid feature encoder network for detecting salient objects in RGB-thermal images

Compensated Attention Feature Fusion and Hierarchical Multiplication Decoder Network for RGB-D Salient Object Detection

MFFNet: Multi-modal Feature Fusion Network for V-D-T Salient Object Detection

HFMDNet: Hierarchical Fusion and Multilevel Decoder Network for RGB-D Salient Object Detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

BSFCDet: Bidirectional Spatial–Semantic Fusion Network Coupled with Channel Attention for Object Detection in Satellite Images

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Feature Calibrating and Fusing Network for RGB-D Salient Object Detection

Multi-modality information refinement fusion network for RGB-D salient object detection

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

DHFNet: Decoupled Hierarchical Fusion Network for RGB-T dense prediction tasks

Hybrid Attention Mechanism and Forward Feedback Unit for RGB-D Salient Object Detection

Feature Enhancement Network for Object Detection in Optical Remote Sensing Images

MFUR-Net