ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Yaoyao Qian,Xupeng Zhu,Ondrej Biza,Shuo Jiang,Linfeng Zhao,Haojie Huang,Yu Qi,Robert Platt
2024-07-16
Abstract:Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
Robotics
What problem does this paper attempt to address?
The paper aims to address the significant challenges faced by robots when grasping target objects in cluttered environments. Specifically, existing methods struggle to accurately identify and grasp objects when they are heavily occluded or completely hidden by other items. To solve this problem, the authors propose the ThinkGrasp system, a plug-in visual language grasping system that combines large-scale pre-trained visual language models (such as GPT-4o). ThinkGrasp achieves its goals through the following means: 1. **Advanced reasoning capabilities**: Utilizing the contextual reasoning abilities of GPT-4o to understand and segment objects in the environment. 2. **Goal-oriented language guidance**: Using natural language instructions to guide the removal of occlusions, thereby gradually exposing the target object and ultimately achieving a high success rate in grasping. 3. **Modular design**: The system adopts a modular design, making it easy to integrate into various robotic platforms and quickly adapt to new language goals and new objects. Experimental results show that ThinkGrasp significantly outperforms existing methods in both simulated and real environments in terms of success rate and efficiency. Particularly in heavily cluttered environments and scenarios with unseen objects, ThinkGrasp demonstrates strong generalization capabilities.