MARIS: Referring Image Segmentation Via Mutual-Aware Attention Features

Mengxi Zhang,Yiming Liu,Xiangjun Yin,Huanjing Yue,Jingyu Yang
DOI: https://doi.org/10.48550/arxiv.2311.15727
2023-01-01
Abstract:Referring image segmentation (RIS) aims to segment a particular region basedon a language expression prompt. Existing methods incorporate linguisticfeatures into visual features and obtain multi-modal features for maskdecoding. However, these methods may segment the visually salient entityinstead of the correct referring region, as the multi-modal features aredominated by the abundant visual context. In this paper, we propose MARIS, areferring image segmentation method that leverages the Segment Anything Model(SAM) and introduces a mutual-aware attention mechanism to enhance thecross-modal fusion via two parallel branches. Specifically, our mutual-awareattention mechanism consists of Vision-Guided Attention and Language-GuidedAttention, which bidirectionally model the relationship between visual andlinguistic features. Correspondingly, we design a Mask Decoder to enableexplicit linguistic guidance for more consistent segmentation with the languageexpression. To this end, a multi-modal query token is proposed to integratelinguistic information and interact with visual information simultaneously.Extensive experiments on three benchmark datasets show that our methodoutperforms the state-of-the-art RIS methods. Our code will be publiclyavailable.
What problem does this paper attempt to address?