Bimodal Information Fusion Network for Salient Object Detection based on Transformer

Zhuo Wang,Minyue Xiao,Jing He,Chao Zhang,Keren Fu
DOI: https://doi.org/10.1109/PRML56267.2022.9882262
2022-01-01
Abstract:Inspired by the excellent performance achieved by the Transformer in processing computer vision tasks in recent years, this paper investigates and designs a bimodal information fusion network (BIFNet) for salient object detection (SOD) based on Transformer. Compared with most existing CNN-based or hybrid CNN+Transformer-based models, BIFNet builds on a pure Transformer framework for both the feature extraction and fusion phases. It treats bimodal features as sequence-to-sequence context information to learn cross-modality context-aware feature representation. Specifically, we first utilize the weight-sharing Swin Transformer as a Siamese backbone to extract bimodal hierarchical sequence features. Next, we design two modules, namely dual self-and-mutual attention (DSMA) module and multi-scale fusion (MSF) module, to further explore saliency cues and fully and effectively fuse information of the two modalities. We apply BIFNet to RGB-D SOD and RGB-T SOD bimodal detection tasks, and comprehensive experimental results on public benchmark datasets demonstrate the superiority of our proposed method.
What problem does this paper attempt to address?