Cross-stage Feature Fusion and Efficient Self-Attention for Salient Object Detection

Xiaofeng Xia,Yingdong Ma
DOI: https://doi.org/10.1016/j.jvcir.2024.104271
IF: 2.887
2024-01-01
Journal of Visual Communication and Image Representation
Abstract:Salient Object Detection (SOD) approaches usually aggregate high-level semantics with object details layer by layer through a pyramid fusion structure. However, the progressive feature fusion mechanism may lead to gradually dilution of valuable semantics and prediction accuracy. In this work, we propose a Cross-stage Feature Fusion Network (CFFNet) for salient object detection. CFFNet consists of a Cross-stage Semantic Fusion Module (CSF), a Feature Filtering and Fusion Module (FFM), and a progressive decoder to tackle the above problems. Specifically, to alleviate the semantics dilution problem, CSF concatenates different stage backbone features and extracts multi-scale global semantics using transformer blocks. Global semantics are then distributed to corresponding backbone stages for cross-stage semantic fusion. The FFM module implements efficient self-attention-based feature fusion. Different from regular self-attention which has quadratic computational complexity. Finally, a progressive decoder is adopted to refine saliency maps. Experimental results demonstrate that CFFNet outperforms state-of-the-arts on six SOD datasets.
What problem does this paper attempt to address?