Abstract:Task-oriented grasping, which involves grasping specific parts of objects based on their functions, is crucial for developing advanced robotic systems capable of performing complex tasks in dynamic environments. In this paper, we propose a training-free framework that incorporates both semantic and geometric priors for zero-shot task-oriented grasp generation. The proposed framework, SegGrasp, first leverages the vision-language models like GLIP for coarse segmentation. It then uses detailed geometric information from convex decomposition to improve segmentation quality through a fusion policy named GeoFusion. An effective grasp pose can be generated by a grasping network with improved segmentation. We conducted the experiments on both segmentation benchmark and real-world robot grasping. The experimental results show that SegGrasp surpasses the baseline by more than 15\% in grasp and segmentation performance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **task - oriented grasping**, especially how to make robots grasp according to the functional parts of objects in a dynamic environment. Specifically, the paper proposes a framework named **SegGrasp**, which can generate accurate task - oriented grasping postures by combining semantic and geometric information without training (i.e., zero - sample learning). #### Main problems include: 1. **Requirements for task - oriented grasping**: - Robots need to grasp according to different functional parts of objects. For example, grasp the head of a hammer for driving nails, not other parts. - Such functional grasping is crucial for performing complex tasks, especially in dynamic environments. 2. **Limitations of existing methods**: - **Traditional methods**: Rely on prior knowledge of object grasping learned from collected data, but these methods are difficult to generalize to unseen objects or scenes. - **Methods based on large - language models (LLM)**: Although they have strong generalization ability, they are insufficient in using geometric information, resulting in insufficient grasping accuracy. - **Insufficient use of geometric information**: Some methods fail to fully utilize geometric features, resulting in unstable grasping performance. 3. **Challenges of zero - sample learning**: - How to achieve task - oriented grasping of new objects or scenes without using additional training data. ### Solutions To overcome the above problems, the SegGrasp framework adopts the following strategies: - **Semantically - guided coarse segmentation**: Use vision - language models (such as GLIP and Grounding DINO) for initial coarse segmentation and extract the semantic information of objects. - **Geometrically - guided fine segmentation**: Obtain geometric information through convex decomposition and design a fusion strategy GeoFusion to improve the segmentation quality. - **Grasping posture generation**: Use the improved segmentation results to generate high - quality grasping postures through Contact - GraspNet. Through these methods, SegGrasp achieves higher segmentation and grasping performance than existing methods and performs well in multiple benchmark tests. ### Summary The core problem of this paper is to develop a task - oriented grasping framework that can adapt to new scenarios without additional training. By combining semantic and geometric information, SegGrasp significantly improves the accuracy and robustness of grasping in a zero - sample learning environment.

SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation

Unseen Object Few-Shot Semantic Segmentation for Robotic Grasping

A learning framework for semantic reach-to-grasp tasks integrating machine learning and optimization.

Show and Grasp: Few-shot Semantic Segmentation for Robot Grasping through Zero-shot Foundation Models

Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping

SG-Grasp: Semantic Segmentation Guided Robotic Grasp Oriented to Weakly Textured Objects Based on Visual Perception Sensors

A Robotic Semantic Grasping Method for Pick-and-place Tasks

ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

GoalGrasp: Grasping Goals in Partially Occluded Scenarios without Grasp Training

Self-Supervised Instance Segmentation by Grasping

Learning 6-DoF Task-oriented Grasp Detection via Implicit Estimation and Visual Affordance

A Vision-based Robot Grasping System

Remote Task-oriented Grasp Area Teaching By Non-Experts through Interactive Segmentation and Few-Shot Learning

GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping

Simultaneous Semantic and Collision Learning for 6-DoF Grasp Pose Estimation

AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains

S4G: Amodal Single-view Single-Shot SE(3) Grasp Detection in Cluttered Scenes

Picking from Clutter: An Object Segmentation Method for Robot Grasping.

Single-View Scene Point Cloud Human Grasp Generation