Receptive Field Broadening and Boosting for Salient Object Detection

Mingcan Ma,Changqun Xia,Chenxi Xie,Xiaowu Chen,Jia Li
DOI: https://doi.org/10.48550/arXiv.2110.07859
2021-10-15
Abstract:Salient object detection requires a comprehensive and scalable receptive field to locate the visually significant objects in the image. Recently, the emergence of visual transformers and multi-branch modules has significantly enhanced the ability of neural networks to perceive objects at different scales. However, compared to the traditional backbone, the calculation process of transformers is time-consuming. Moreover, different branches of the multi-branch modules could cause the same error back propagation in each training iteration, which is not conducive to extracting discriminative features. To solve these problems, we propose a bilateral network based on transformer and CNN to efficiently broaden local details and global semantic information simultaneously. Besides, a Multi-Head Boosting (MHB) strategy is proposed to enhance the specificity of different network branches. By calculating the errors of different prediction heads, each branch can separately pay more attention to the pixels that other branches predict incorrectly. Moreover, Unlike multi-path parallel training, MHB randomly selects one branch each time for gradient back propagation in a boosting way. Additionally, an Attention Feature Fusion Module (AF) is proposed to fuse two types of features according to respective characteristics. Comprehensive experiments on five benchmark datasets demonstrate that the proposed method can achieve a significant performance improvement compared with the state-of-the-art methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively expand the receptive field of neural networks in the Salient Object Detection (SOD) task while maintaining the computational efficiency and performance of the model. Specifically, the author points out the following challenges: 1. **Expansion of the receptive field**: Traditional Convolutional Neural Networks (CNNs) perform poorly when dealing with large - scale images or tasks requiring global information due to the limitation of their local receptive fields. Although Visual Transformers can provide a broader global perspective, their computational cost is high, especially when the input image resolution is large. 2. **Synchronous error back - propagation of multi - branch modules**: Current multi - branch modules adopt a synchronous error back - propagation strategy during the training process, which may lead to high similarity of features extracted by different branches and lack of diversity, thus affecting the overall performance of the model. 3. **Complementarity between CNNs and transformers**: There are differences in feature representation, computational complexity, and attention mechanisms between CNNs and transformers. How to fully utilize the advantages of these two models to achieve complementarity is a problem worthy of exploration. To solve the above problems, the author proposes a Bilateral Network based on CNNs and transformers, as well as a Multi - Head Boosting (MHB) strategy. The specific methods are as follows: - **Bilateral Network**: Combines a lightweight CNN and a transformer with low - resolution input to efficiently extract local details and global semantic information. The lightweight CNN is responsible for quickly extracting detailed information at high resolution, while the transformer generates globally relevant features at low - resolution input. - **Attention Feature Fusion Module (AF)**: Through the cross - attention compensation mechanism, fuses the features of CNNs and transformers to enhance their respective advantages. - **Multi - Head Boosting (MHB)**: Randomly selects a branch for gradient back - propagation and weights each branch according to the prediction errors of other branches, thereby improving the complementarity between different branches and enhancing the overall performance of the model. Through these innovative methods, the experimental results on multiple benchmark datasets in the paper show that the proposed method has achieved significant performance improvement in the salient object detection task, especially with obvious improvement in the MAE index compared with existing methods.