Abstract:RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low-and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.

ABC-Trans: a novel adaptive border-augmented cross-attention transformer for object detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

An Extendable, Efficient and Effective Transformer-based Object Detector

DA-DETR: Domain Adaptive Detection Transformer with Information Fusion

Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer

Deformable DETR: Deformable Transformers for End-to-End Object Detection

End-to-End Object Detection with Adaptive Clustering Transformer

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Anchor DETR: Query Design for Transformer-Based Detector

Cross-Modality Fusion Transformer for Multispectral Object Detection

: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

Efficient Decoder-Free Object Detection with Transformers

GroupTransNet: Group transformer network for RGB-D salient object detection

DETR++: Taming Your Multi-Scale Detection Transformer

CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector

Cross-Modality Global Correlational-based Visual Transformer for RGB-D Salient Object Detection

CNN-transformer mixed model for object detection

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection