Deep Fourier-embedded Network for Bi-modal Salient Object Detection

Pengfei Lyu,Xiaosheng Yu,Chengdong Wu,Jagath C. Rajapakse
2024-11-27
Abstract:The rapid development of deep learning provides a significant improvement of salient object detection combining both RGB and thermal images. However, existing deep learning-based models suffer from two major shortcomings. First, the computation and memory demands of Transformer-based models with quadratic complexity are unbearable, especially in handling high-resolution bi-modal feature fusion. Second, even if learning converges to an ideal solution, there remains a frequency gap between the prediction and ground truth. Therefore, we propose a purely fast Fourier transform-based model, namely deep Fourier-embedded network (DFENet), for learning bi-modal information of RGB and thermal images. On one hand, fast Fourier transform efficiently fetches global dependencies with low complexity. Inspired by this, we design modal-coordinated perception attention to fuse the frequency gap between RGB and thermal modalities with multi-dimensional representation enhancement. To obtain reliable detailed information during decoding, we design the frequency-decomposed edge-aware module (FEM) to clarify object edges by deeply decomposing low-level features. Moreover, we equip proposed Fourier residual channel attention block in each decoder layer to prioritize high-frequency information while aligning channel global relationships. On the other hand, we propose co-focus frequency loss (CFL) to steer FEM towards minimizing the frequency gap. CFL dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing the bi-modal edge information in the Fourier domain. This frequency-level refinement of edge features further contributes to the quality of the final pixel-level prediction. Extensive experiments on four bi-modal salient object detection benchmark datasets demonstrate our proposed DFENet outperforms twelve existing state-of-the-art models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems existing in current deep - learning models in bi - modal saliency object detection (BSOD): 1. **Excessively high computational and memory requirements**: Transformer - based models, due to their quadratic complexity, have unbearable computational and memory requirements when processing high - resolution bi - modal feature fusion. This limits the efficiency and scalability of the models in practical applications. 2. **Frequency gap problem**: Even if the model converges to the ideal solution, there is still a frequency gap between the prediction results and the ground - truth labels, especially during the edge feature reconstruction process, where high - frequency information is easily ignored or lost. To solve these problems, the authors propose a new network based on the fast Fourier transform (FFT) - the Deep Fourier - embedded Network (DFENet). The main contributions of DFENet include: - **Global relationship alignment**: Utilizing the efficient global representation ability of FFT, DFENet can achieve global relationship alignment at each stage while minimizing memory consumption and computational complexity. - **Modal - coordinated Perceptual Attention (MPA)**: Through the re - embedding strategy, it models the complementary information between RGB and thermal imaging modalities in the spatial and channel dimensions, thereby obtaining more accurate cross - modal fusion features. - **Frequency - domain Decomposed Edge - aware Module (FEM)**: FEM is designed to clarify object edge features from cluttered backgrounds and extract edge frequencies through multi - step decomposition to guide the fusion of multi - level features. - **Fourier Residual Channel Attention Block (FRCAB)**: FRCAB is introduced in each decoding layer to emphasize high - frequency information and global dependencies in the channel dimension, ensuring the effectiveness and consistency of features. - **Confocal Frequency Loss (CFL)**: By cross - referring to the edge information of RGB and thermal imaging modalities, it dynamically weights difficult frequencies and guides FEM to optimize edge feature reconstruction in the frequency domain, further improving the quality of the final pixel - level prediction. Through these innovations, DFENet outperforms the existing 12 state - of - the - art models on four bi - modal saliency object detection benchmark datasets, demonstrating its superior performance in handling complex scenes.