ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Yaoyao Qian,Xupeng Zhu,Ondrej Biza,Shuo Jiang,Linfeng Zhao,Haojie Huang,Yu Qi,Robert Platt

2024-07-16

Abstract:Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.

Robotics

What problem does this paper attempt to address?

The paper aims to address the significant challenges faced by robots when grasping target objects in cluttered environments. Specifically, existing methods struggle to accurately identify and grasp objects when they are heavily occluded or completely hidden by other items. To solve this problem, the authors propose the ThinkGrasp system, a plug-in visual language grasping system that combines large-scale pre-trained visual language models (such as GPT-4o). ThinkGrasp achieves its goals through the following means: 1. **Advanced reasoning capabilities**: Utilizing the contextual reasoning abilities of GPT-4o to understand and segment objects in the environment. 2. **Goal-oriented language guidance**: Using natural language instructions to guide the removal of occlusions, thereby gradually exposing the target object and ultimately achieving a high success rate in grasping. 3. **Modular design**: The system adopts a modular design, making it easy to integrate into various robotic platforms and quickly adapt to new language goals and new objects. Experimental results show that ThinkGrasp significantly outperforms existing methods in both simulated and real environments in terms of success rate and efficiency. Particularly in heavily cluttered environments and scenarios with unseen objects, ThinkGrasp demonstrates strong generalization capabilities.

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes

A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

Decision-Making in Robotic Grasping with Large Language Models.

Language-Guided Category Push–Grasp Synergy Learning in Clutter by Efficiently Perceiving Object Manipulation Space

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

A Vision-based Robot Grasping System

Grasp Region Exploration for 7-Dof Robotic Grasping in Cluttered Scenes

GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

UPG: 3D Vision-Based Prediction Framework for Robotic Grasping in Multi-Object Scenes.

GE-Grasp: Efficient Target-Oriented Grasping in Dense Clutter

Lan-grasp: Using Large Language Models for Semantic Object Grasping

Efficient and Robust Robotic Grasping in Cluttered Scenes: A Point Cloud-Based Approach with Heuristic Evaluation.

Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains

MVGrasp: Real-time multi-view 3D object grasping in highly cluttered environments

Robotic Grasping in Multi-Object Stacking Scenes Based on Visual Reasoning

FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models