RISC: Boosting High-quality Referring Image Segmentation Via Foundation Model CLIP

Zongyuan Jiang,Jiayu Chen,Chongyu Liu,Ning Zhang,Jun Huang,Xue Gao,Lianwen Jin
DOI: https://doi.org/10.1109/icme57554.2024.10687800
2024-01-01
Abstract:Foundation model CLIP has garnered significant attention worldwide in recent years due to its tremendous capabilities in various domains of deep learning. However, the knowledge acquired from image-text pairs in CLIP cannot be sufficiently transferred to dense prediction tasks like referring image segmentation. In this paper, we propose an effective framework, termed RISC, to thoroughly exploit the potential of CLIP to boost high-quality referring image segmentation. Specifically, to transfer the remarkable knowledge from CLIP to the pixel-text level, we introduce a CLIP-driven Dense Decoder to integrate features at different scales and modalities from CLIP in a fine-grained manner. Furthermore, to maximize the vision-text matching capabilities from CLIP, a Lightweight Pixel Refiner is proposed to generate masks with distinct boundaries through point sampling and matching strategies. Extensive experiments demonstrate that our approach outperforms the previous state- of-the-art methods by a notable margin on three widely-used datasets (RefCOCO, RefCOCO+ and RefCOCOg).
What problem does this paper attempt to address?