Abstract:In the realm of human–robot interaction, the integration of visual and verbal cues has become increasingly significant. This paper focuses on the challenges and advancements in referring image segmentation (RIS), a task that involves segmenting images based on textual descriptions. Traditional approaches to RIS have primarily focused on pixel-level classification. These methods, although effective, often overlook the interconnectedness of pixels, which can be crucial for interpreting complex visual scenes. Furthermore, while the PolyFormer model has shown impressive performance in RIS, its large number of parameters and high training data requirements pose significant challenges. These factors restrict its adaptability and optimization on standard consumer hardware, hindering further enhancements in subsequent research. Addressing these issues, our study introduces a novel two-branch decoder framework with SAM (segment anything model) for RIS. This framework incorporates an MLP decoder and a KAN decoder with a multi-scale feature fusion module, enhancing the model's capacity to discern fine details within images. The framework's robustness is further bolstered by an ensemble learning strategy that consolidates the insights from both the MLP and KAN decoder branches. More importantly, we collect the segmentation target edge coordinates and bounding box coordinates as input cues for the SAM model. This strategy leverages SAM's zero-sample learning capabilities to refine and optimize the segmentation outcomes. Our experimental findings, based on the widely recognized RefCOCO, RefCOCO+, and RefCOCOg datasets, confirm the effectiveness of this method. The results not only achieve state-of-the-art (SOTA) performance in segmentation but are also supported by ablation studies that highlight the contributions of each component to the overall improvement in performance.

SegICP: Integrated Deep Semantic Segmentation and Pose Estimation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Semantic Segmentation and 6DoF Pose Estimation using RGB-D Images and Deep Neural Networks

ProcNet: Deep Predictive Coding Model for Robust-to-occlusion Visual Segmentation and Pose Estimation

An Onboard Point Cloud Semantic Segmentation System for Robotic Platforms

RISeg: Robot Interactive Object Segmentation via Body Frame-Invariant Features

KISS-ICP: In Defense of Point-to-Point ICP -- Simple, Accurate, and Robust Registration If Done the Right Way

Deep instance segmentation and 6D object pose estimation in cluttered scenes for robotic autonomous grasping

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Sparse Convolution Based 6D Pose Estimation for Robotic Bin-Picking with Point Clouds

Bridging the Robot Perception Gap with Mid-Level Vision

Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

Simultaneous Semantic and Collision Learning for 6-DoF Grasp Pose Estimation

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

Real-time 3D Semantic Scene Perception for Egocentric Robots with Binocular Vision

6D Pose Estimation of Industrial Parts Based on Point Cloud Geometric Information Prediction for Robotic Grasping

A Method for Unseen Object Six Degrees of Freedom Pose Estimation Based on Segment Anything Model and Hybrid Distance Optimization

RGB-D-Based Pose Estimation of Workpieces with Semantic Segmentation and Point Cloud Registration

Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for Robotic Grasping

A Segmentation-Driven Approach for 6D Object Pose Estimation in the Crowd