Abstract:In the realm of human–robot interaction, the integration of visual and verbal cues has become increasingly significant. This paper focuses on the challenges and advancements in referring image segmentation (RIS), a task that involves segmenting images based on textual descriptions. Traditional approaches to RIS have primarily focused on pixel-level classification. These methods, although effective, often overlook the interconnectedness of pixels, which can be crucial for interpreting complex visual scenes. Furthermore, while the PolyFormer model has shown impressive performance in RIS, its large number of parameters and high training data requirements pose significant challenges. These factors restrict its adaptability and optimization on standard consumer hardware, hindering further enhancements in subsequent research. Addressing these issues, our study introduces a novel two-branch decoder framework with SAM (segment anything model) for RIS. This framework incorporates an MLP decoder and a KAN decoder with a multi-scale feature fusion module, enhancing the model's capacity to discern fine details within images. The framework's robustness is further bolstered by an ensemble learning strategy that consolidates the insights from both the MLP and KAN decoder branches. More importantly, we collect the segmentation target edge coordinates and bounding box coordinates as input cues for the SAM model. This strategy leverages SAM's zero-sample learning capabilities to refine and optimize the segmentation outcomes. Our experimental findings, based on the widely recognized RefCOCO, RefCOCO+, and RefCOCOg datasets, confirm the effectiveness of this method. The results not only achieve state-of-the-art (SOTA) performance in segmentation but are also supported by ablation studies that highlight the contributions of each component to the overall improvement in performance.

Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Cross-Modal Fusing Vision-Language Network for Referring Image Segmentation

DCMFNet: Deep Cross-Modal Fusion Network for Referring Image Segmentation with Iterative Gated Fusion

Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation

CMF: Cascaded Multi-Model Fusion for Referring Image Segmentation

Structured Multimodal Fusion Network for Referring Image Segmentation

Multiscale Deep Feature Selection Fusion Network for Referring Image Segmentation

Referring Image Segmentation with Fine-Grained Semantic Funneling Infusion

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

DCMFNet: Deep Cross-Modal Fusion Network for Different Modalities with Iterative Gated Fusion

Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

Fully Aligned Network for Referring Image Segmentation

Adaptive Selection Based Referring Image Segmentation

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

Referring Segmentation Via Encoder-Fused Cross-Modal Attention Network

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Multimodal Fusion Refiner Networks

CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation

RISC: Boosting High-quality Referring Image Segmentation Via Foundation Model CLIP