Adaptive Selection Based Referring Image Segmentation

Pengfei Yue,Jianghang Lin,Shengchuan Zhang,Jie Hu,Yilin Lu,Hongwei Niu,Haixin Ding,Yan Zhang,Guannan Jiang,Liujuan Cao,Rongrong Ji
DOI: https://doi.org/10.1145/3664647.3680850
2024-01-01
Abstract:Referring image segmentation (RIS) aims to segment a particular region based on a specific expression. Existing one-stage methods have explored various fusion strategies, yet they encounter two significant issues. Primarily, most methods rely on manually selected visual features from the visual encoder layers. Moreover, the direct fusion of word-level features into coarse aligned features disrupts the established vision-language alignment. In this paper, we introduce an innovative framework for RIS that seeks to overcome these challenges with adaptive alignment of vision and language features, termed the Adaptive Selection with Dual Alignment (ASDA). ASDA innovates in two aspects. Firstly, we design an Adaptive Feature Selection and Fusion (AFSF) module to dynamically select visual features focusing on different regions related to various descriptions. AFSF is equipped with scale-wise feature aggregator to provide hierarchically coarse features that preserve crucial low-level details. Secondly, a Word Guided Dual-Branch Aligner (WGDA) is leveraged to integrate coarse features with linguistic cues by word-guided attention, which effectively addresses the common issue of vision-language misalignment. Extensive experimental results demonstrate that our ASDA framework surpasses state-of-the-art methods on RefCOCO, RefCOCO+ and G-Ref benchmark.
What problem does this paper attempt to address?