UniTR: A Unified TRansformer-based Framework for Co-object and Multi-modal Saliency Detection

Ruohao Guo,Xianghua Ying,Yanyu Qi,Liao Qu
DOI: https://doi.org/10.1109/tmm.2024.3369922
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Recent years have witnessed a growing interest in co-object segmentation and multi-modal salient object detection. Many efforts are devoted to segmenting co-existed objects among a group of images or detecting salient objects from different modalities. Albeit the appreciable performance achieved on respective benchmarks, each of these methods is limited to a specific task and cannot be generalized to other tasks. In this paper, we develop a Unified TRansformer-based framework, namely UniTR, aiming at tackling the above tasks individually with a unified architecture. Specifically, a transformer module (CoFormer) is introduced to learn the consistency of relevant objects or complementarity from different modalities. To generate high-quality segmentation maps, we adopt a dual-stream decoding paradigm that allows the extracted consistent or complementary information to better guide mask prediction. Moreover, a feature fusion module (ZoomFormer) is designed to enhance backbone features and capture multi-granularity and multi-semantic information. Extensive experiments show that our UniTR performs well on 17 benchmarks, and surpasses existing state-of-the-art approaches.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?