Abstract:The U-shaped encoder-decoder architecture based on CNNs has been rooted in salient object detection (SOD) tasks, and it have revealed two drawbacks while driving the rapid development of saliency detection. (1) The inherent characteristics of CNNs dictate that it is difficult to learn long-range dependencies and model global correlations. (2) For the common purpose of improving the performance of saliency detection, the encoder and decoder should complement each other and work together. However, the existing encoder-decoder architecture treats encoder and decoder independently of each other. Specifically, the encoder is responsible for extracting features and the decoder fuses multi-level or multi-modal features to produce prediction maps. That is, the encoder alone needs to be responsible for the decoder, while the valuable information after the decoder fusion will not facilitate feature extraction. Therefore, we propose a unidirectional RGB-T salient object detection network with intertwined driving of encoding and fusion to solve the above problems. Firstly, we introduce transformer (SegFormer) as the backbone of the network to deal with the problem that CNNs are difficult to establish long-range dependence. Secondly, we constructed a unidirectional architecture where encoding and fusion are intertwined and mutually driving, which discards the drawbacks of encoder-decoder architecture to make the network more powerful and concise. Based on the unidirectional architecture, the proposed Local Detail-driven Fusion Module (LDFM) uses the fused features of the previous level to drive the cross-modal fusion at the current level. Meanwhile, the proposed Local Detail-driven Weighting Module (LDWM) uses the fused features to drive the cross-modal weighting. They will drive more effective features to be fed into the next level of the encoding block. Comprehensive experiments have verified the superior performance of our method on the RGB-T saliency detection task.

Transformer Fusion and Pixel-Level Contrastive Learning for RGB-D Salient Object Detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

Transformers and CNNs Fusion Network for Salient Object Detection

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

ETFormer: an Efficient Transformer Based on Multimodal Hybrid Fusion and Representation Learning for RGB-D-T Salient Object Detection

Learning Selective Mutual Attention and Contrast for RGB-D Saliency Detection

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Unidirectional RGB-T salient object detection with intertwined driving of encoding and fusion

Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection

Transformer-based Network for RGB-D Saliency Detection

A Unified Structure for Efficient RGB and RGB-D Salient Object Detection

RGBD Salient Object Detection via Disentangled Cross-modal Fusion

Compensated Attention Feature Fusion and Hierarchical Multiplication Decoder Network for RGB-D Salient Object Detection

TranSal: Depth-guided Transformer for RGB-D Salient Object Detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Adaptive Fusion for RGB-D Salient Object Detection.