Abstract:Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

RGBD Salient Object Detection via Disentangled Cross-modal Fusion

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Transformers and CNNs Fusion Network for Salient Object Detection

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

ETFormer: an Efficient Transformer Based on Multimodal Hybrid Fusion and Representation Learning for RGB-D-T Salient Object Detection

CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection

Cross-Modality Fusion Transformer for Multispectral Object Detection

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Discriminative Cross-Modal Transfer Learning and Densely Cross-Level Feedback Fusion for RGB-D Salient Object Detection

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Transformer-based Network for RGB-D Saliency Detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

Enabling modality interactions for RGB-T salient object detection

TranSal: Depth-guided Transformer for RGB-D Salient Object Detection

DFTR: Depth-supervised Fusion Transformer for Salient Object Detection