Abstract:In the realm of human–robot interaction, the integration of visual and verbal cues has become increasingly significant. This paper focuses on the challenges and advancements in referring image segmentation (RIS), a task that involves segmenting images based on textual descriptions. Traditional approaches to RIS have primarily focused on pixel-level classification. These methods, although effective, often overlook the interconnectedness of pixels, which can be crucial for interpreting complex visual scenes. Furthermore, while the PolyFormer model has shown impressive performance in RIS, its large number of parameters and high training data requirements pose significant challenges. These factors restrict its adaptability and optimization on standard consumer hardware, hindering further enhancements in subsequent research. Addressing these issues, our study introduces a novel two-branch decoder framework with SAM (segment anything model) for RIS. This framework incorporates an MLP decoder and a KAN decoder with a multi-scale feature fusion module, enhancing the model's capacity to discern fine details within images. The framework's robustness is further bolstered by an ensemble learning strategy that consolidates the insights from both the MLP and KAN decoder branches. More importantly, we collect the segmentation target edge coordinates and bounding box coordinates as input cues for the SAM model. This strategy leverages SAM's zero-sample learning capabilities to refine and optimize the segmentation outcomes. Our experimental findings, based on the widely recognized RefCOCO, RefCOCO+, and RefCOCOg datasets, confirm the effectiveness of this method. The results not only achieve state-of-the-art (SOTA) performance in segmentation but are also supported by ablation studies that highlight the contributions of each component to the overall improvement in performance.

Unambiguous Scene Text Segmentation with Referring Expression Comprehension

Unambiguous Text Localization, Retrieval, and Recognition for Cluttered Scenes

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation

Text-Vision Relationship Alignment for Referring Image Segmentation

Scene Text Detection via Holistic, Multi-Channel Prediction

Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

Referring Image Segmentation via Cross-Modal Progressive Comprehension

Improving Referring Image Segmentation using Vision-Aware Text Features

Beyond One-to-One: Rethinking the Referring Image Segmentation

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions

Visual and semantic guided scene text retrieval

Scene text recognition in mobile applications by character descriptor and structure configuration

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Rethinking Referring Object Removal

Locate then Segment: A Strong Pipeline for Referring Image Segmentation

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension