Cross-Modality Global Correlational-based Visual Transformer for RGB-D Salient Object Detection

Lina Gao,Ping Fu,Lei Feng,Tiantian Wang,Bing Liu
DOI: https://doi.org/10.1109/icsmd57530.2022.10058434
2022-01-01
Abstract:Existing convolutional neural networks (CNN) can provide contexture features within certain receptive fields and achieve promising performance for RGB-D saliency prediction tasks. However, these CNN-based models still suffer from the challenge of learning global cues due to an inherent restriction of CNN. To address this issue, we propose a pure transformer network to explicitly model the correlation of the cross-modality token and explore global-local hierarchical features. Specifically, a cross-modality fusion module (CMFM) is designed to integrate the congruity and difference information. We also conduct a transformer decoder to decode the global context and each local token. Finally, we validate our model on five challenging datasets under five evaluation metrics against ten representative models. The experiment results demonstrate that our model performance significantly exceeds previous SOTA models. The ablation experiments also reveal the benefits of our model.
What problem does this paper attempt to address?