PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network

Changqun Xia,Chenxi Xie,Zhentao He,Tianshu Yu,Jia Li
2024-08-02
Abstract:We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD, containing 5,920 images from real-world complex scenarios at 4K-8K resolutions. All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets. Aiming at overcoming the contradiction between the sampling depth and the receptive field size in the past methods, we propose a novel one-stage framework for HR-SOD task using pyramid grafting mechanism. In general, transformer-based and CNN-based backbones are adopted to extract features from different resolution images independently and then these features are grafted from transformer branch to CNN branch. An attention-based Cross-Model Grafting Module (CMGM) is proposed to enable CNN branch to combine broken detailed information more holistically, guided by different source feature during decoding process. Moreover, we design an Attention Guided Loss (AGL) to explicitly supervise the attention matrix generated by CMGM to help the network better interact with the attention from different branches. Comprehensive experiments on UHRSD and widely-used SOD datasets demonstrate that our method can simultaneously locate salient object and preserve rich details, outperforming state-of-the-art methods. To verify the generalization ability of the proposed framework, we apply it to the camouflaged object detection (COD) task. Notably, our method performs superior to most state-of-the-art COD methods without bells and whistles.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges in high - resolution salient object detection (HRSOD). Specifically, the authors focus on the following issues: 1. **Lack of high - resolution datasets**: - Most of the existing salient object detection (SOD) datasets are of low resolution (less than 512×512 pixels), which results in the model being unable to obtain sufficient detail information when dealing with high - resolution inputs. - To make up for this deficiency, the authors constructed a large - scale high - resolution salient object detection dataset UHRSD, which contains 5,920 4K - 8K resolution images from real - complex scenes and has been finely labeled at the pixel level. 2. **Contradiction between sampling depth and receptive field size**: - In past SOD methods, there is a contradiction between the sampling depth and the receptive field size of the network. The traditional FPN (Feature Pyramid Network) can only extract features within a limited range and it is difficult to take into account both global semantics and rich details simultaneously. - For this reason, the authors proposed a new single - stage framework to solve this problem through the pyramid grafting mechanism (Pyramid Grafting Mechanism). This framework combines the advantages of Transformer and CNN, independently extracts features from images of different resolutions, and grafts these features from the Transformer branch to the CNN branch. 3. **Computational burden brought by high - resolution inputs**: - Processing high - resolution images will bring a huge computational burden. Directly inputting high - resolution images into existing SOD models will lead to slower inference speed and it is difficult to recover the lost details. - The authors designed an asymmetric feature extraction strategy, using a lightweight CNN to capture the spatial features of large inputs, and using a Transformer to capture the context features of regular inputs, thereby optimizing the computational burden and forming a complementary effect. 4. **Cross - model feature grafting and attention - guided loss**: - In order to better graft heterogeneous features, the authors proposed a cross - model grafting module (Cross - Model Grafting Module, CMGM) based on the attention mechanism, and further guided the grafting process through the attention - guided loss (Attention Guided Loss, AGL), enabling the network to better interact with features from different branches. In summary, the main goal of this paper is to solve the challenges encountered by existing SOD methods when dealing with high - resolution inputs by constructing a high - quality high - resolution dataset and proposing an innovative network architecture, thereby achieving more accurate and efficient salient object detection.