Abstract:Salient object detection requires a comprehensive and scalable receptive field to locate the visually significant objects in the image. Recently, the emergence of visual transformers and multi-branch modules has significantly enhanced the ability of neural networks to perceive objects at different scales. However, compared to the traditional backbone, the calculation process of transformers is time-consuming. Moreover, different branches of the multi-branch modules could cause the same error back propagation in each training iteration, which is not conducive to extracting discriminative features. To solve these problems, we propose a bilateral network based on transformer and CNN to efficiently broaden local details and global semantic information simultaneously. Besides, a Multi-Head Boosting (MHB) strategy is proposed to enhance the specificity of different network branches. By calculating the errors of different prediction heads, each branch can separately pay more attention to the pixels that other branches predict incorrectly. Moreover, Unlike multi-path parallel training, MHB randomly selects one branch each time for gradient back propagation in a boosting way. Additionally, an Attention Feature Fusion Module (AF) is proposed to fuse two types of features according to respective characteristics. Comprehensive experiments on five benchmark datasets demonstrate that the proposed method can achieve a significant performance improvement compared with the state-of-the-art methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively expand the receptive field of neural networks in the Salient Object Detection (SOD) task while maintaining the computational efficiency and performance of the model. Specifically, the author points out the following challenges: 1. **Expansion of the receptive field**: Traditional Convolutional Neural Networks (CNNs) perform poorly when dealing with large - scale images or tasks requiring global information due to the limitation of their local receptive fields. Although Visual Transformers can provide a broader global perspective, their computational cost is high, especially when the input image resolution is large. 2. **Synchronous error back - propagation of multi - branch modules**: Current multi - branch modules adopt a synchronous error back - propagation strategy during the training process, which may lead to high similarity of features extracted by different branches and lack of diversity, thus affecting the overall performance of the model. 3. **Complementarity between CNNs and transformers**: There are differences in feature representation, computational complexity, and attention mechanisms between CNNs and transformers. How to fully utilize the advantages of these two models to achieve complementarity is a problem worthy of exploration. To solve the above problems, the author proposes a Bilateral Network based on CNNs and transformers, as well as a Multi - Head Boosting (MHB) strategy. The specific methods are as follows: - **Bilateral Network**: Combines a lightweight CNN and a transformer with low - resolution input to efficiently extract local details and global semantic information. The lightweight CNN is responsible for quickly extracting detailed information at high resolution, while the transformer generates globally relevant features at low - resolution input. - **Attention Feature Fusion Module (AF)**: Through the cross - attention compensation mechanism, fuses the features of CNNs and transformers to enhance their respective advantages. - **Multi - Head Boosting (MHB)**: Randomly selects a branch for gradient back - propagation and weights each branch according to the prediction errors of other branches, thereby improving the complementarity between different branches and enhancing the overall performance of the model. Through these innovative methods, the experimental results on multiple benchmark datasets in the paper show that the proposed method has achieved significant performance improvement in the salient object detection task, especially with obvious improvement in the MAE index compared with existing methods.

Receptive Field Broadening and Boosting for Salient Object Detection

Dual-Branch Feature Fusion Network for Salient Object Detection

Boosting Broader Receptive Fields for Salient Object Detection.

Salient object detection with dual-branch stepwise feature fusion and edge refinement

Dual-path Multi-Branch Feature Residual Network for Salient Object Detection

Bi-attention Network for Bi-Directional Salient Object Detection

Multi-attention Guided Feature Fusion Network for Salient Object Detection

AWANet: Attentive-Aware Wide-Kernels Asymmetrical Network with Blended Contour Information for Salient Object Detection

Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net

Attention-based Bi-Directional Refinement Network for Salient Object Detection

Bidirectional Mutual Guidance Transformer for Salient Object Detection in Optical Remote Sensing Images

Multi-level Features Selection Network Based on Multi-attention for Salient Object Detection.

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Salient Object Detection Via Multi-Scale Neural Network.

SSTNet: Saliency sparse transformers network with tokenized dilation for salient object detection

RGB-D Salient Object Detection Method Based on Multi-Modal Fusion and Contour Guidance

Salient Object Detection Based on Backbone Enhanced Network

Unifying Global-Local Representations in Salient Object Detection with Transformers

Salient Object Detection Via Multiple Instance Joint Re-Learning

Feature Refinement from Multiple Perspectives for High Performance Salient Object Detection.

Multi-Modal Salient Feature Enhance for Rgb-T Salient Object Detection