Abstract:Salient object detection (SOD) is an important preprocessing operation for various computer vision tasks. Most of existing RGB-D SOD models employ additive or connected strategies to directly aggregate and decode multi-scale features to predict salient maps. However, due to the large differences between the features of different scales, these aggregation strategies adopted may lead to information loss or redundancy, and few methods explicitly consider how to establish connections between features at different scales in the decoding process, which consequently deteriorates the detection performance of the models. To this end, we propose a cascaded and aggregated Transformer Network (CATNet) which consists of three key modules, i.e., attention feature enhancement module (AFEM), cross-modal fusion module (CMFM) and cascaded correction decoder (CCD). Specifically, the AFEM is designed on the basis of atrous spatial pyramid pooling to obtain multi-scale semantic information and global context information in high-level features through dilated convolution and multi-head self-attention mechanism, enhancing high-level features. The role of the CMFM is to enhance and thereafter fuse the RGB features and depth features, alleviating the problem of poor-quality depth maps. The CCD is composed of two subdecoders in a cascading fashion. It is designed to suppress noise in low-level features and mitigate the differences between features at different scales. Moreover, the CCD uses a feedback mechanism to correct and repair the output of the subdecoder by exploiting supervised features, so that the problem of information loss caused by the upsampling operation during the multi-scale features aggregation process can be mitigated. Extensive experimental results demonstrate that the proposed CATNet achieves superior performance over 14 state-of-the-art RGB-D methods on 7 challenging benchmarks. The codes are released at https://github.com/ROC-Star/CATNet/ .

HierN et: Hierarchical Transformer U -Shape Network for RGB-D Salient Object Detection

HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness

Compensated Attention Feature Fusion and Hierarchical Multiplication Decoder Network for RGB-D Salient Object Detection

Depth Cue Enhancement and Guidance Network for RGB-D Salient Object Detection

Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection.

HODINet: High-Order Discrepant Interaction Network for RGB-D Salient Object Detection

TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection

HDNet: Multi-Modality Hierarchy-Aware Decision Network for RGB-D Salient Object Detection

TranSal: Depth-guided Transformer for RGB-D Salient Object Detection

CATNet: A Cascaded and Aggregated Transformer Network for RGB-D Salient Object Detection

Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection

Deep Feature Filtering and Contextual Information Gathering Network for RGB-D Salient Object Detection

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

HFMDNet: Hierarchical Fusion and Multi-Level Decoder Network for RGB-D Salient Object Detection

HFMDNet: Hierarchical Fusion and Multilevel Decoder Network for RGB-D Salient Object Detection

ETFormer: an Efficient Transformer Based on Multimodal Hybrid Fusion and Representation Learning for RGB-D-T Salient Object Detection

BTS-Net: Bi-directional Transfer-and-Selection Network For RGB-D Salient Object Detection

MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection

Transformer-based Hierarchical Dynamic Decoders for Salient Object Detection

LIANet: Layer Interactive Attention Network for RGB-D Salient Object Detection

Transformer-based Network for RGB-D Saliency Detection