Abstract:Language-Conditioned Robotic Grasping (LCRG) aims to develop robots that comprehend and grasp objects based on natural language instructions. While the ability to understand personal objects like my wallet facilitates more natural interaction with human users, current LCRG systems only allow generic language instructions, e.g., the black-colored wallet next to the laptop. To this end, we introduce a task scenario GraspMine alongside a novel dataset aimed at pinpointing and grasping personal objects given personal indicators via learning from a single human-robot interaction, rather than a large labeled dataset. Our proposed method, Personalized Grasping Agent (PGA), addresses GraspMine by leveraging the unlabeled image data of the user's environment, called Reminiscence. Specifically, PGA acquires personal object information by a user presenting a personal object with its associated indicator, followed by PGA inspecting the object by rotating it. Based on the acquired information, PGA pseudo-labels objects in the Reminiscence by our proposed label propagation algorithm. Harnessing the information acquired from the interactions and the pseudo-labeled objects in the Reminiscence, PGA adapts the object grounding model to grasp personal objects. This results in significant efficiency while previous LCRG systems rely on resource-intensive human annotations -- necessitating hundreds of labeled data to learn my wallet. Moreover, PGA outperforms baseline methods across all metrics and even shows comparable performance compared to the fully-supervised method, which learns from 9k annotated data samples. We further validate PGA's real-world applicability by employing a physical robot to execute GrsapMine. Code and data are publicly available at <a class="link-external link-https" href="https://github.com/JHKim-snu/PGA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is that current Language-Conditioned Robot Grasping (LCRG) systems primarily rely on generic language instructions to describe and manipulate objects, leading to non-intuitive human-robot interactions. For example, a user might instinctively say "grab my wallet," but existing LCRG systems might require more specific instructions, such as "grab the black wallet next to the laptop." This discrepancy forces users to adjust their instructions to fit the robot's knowledge base, resulting in "robot-centric" instructions that non-expert users may find unfamiliar and cumbersome. To solve this issue, the paper introduces a new personalized task scenario called GraspMine, along with a benchmark dataset. GraspMine aims to locate and grasp personal items using personal referential terms (e.g., "my sleeping pills"), which are difficult for general knowledge systems to handle. GraspMine requires the robot to learn personal items through minimal human-robot interaction (i.e., a single verbal introduction of the personal item), enabling more intuitive user-centric interactions. Specifically, the paper proposes a method called Personalized Grasping Agent (PGA) that achieves its goal through the following steps: 1. **Constructing a Reminiscence Memory Bank**: Collecting a series of raw images from the user's environment. 2. **Acquiring Object Information**: Obtaining information about personal items through two consecutive steps (human-robot interaction and robot-object interaction). - **Human-Robot Interaction**: The user shows the personal item to the robot and verbally describes it. - **Robot-Object Interaction**: The robot examines the item from multiple angles, obtaining multi-view images of it. 3. **Propagating Labels through the Memory Bank**: Using the acquired personal item information to propagate personal referential terms to unlabeled objects in the memory bank through visual features. 4. **Adaptive Object Localization Model**: Training an object localization model using the data obtained during interactions and pseudo-labeled objects, enabling it to accurately grasp the user's specified personal items. Experimental results show that PGA significantly outperforms baseline methods on multiple metrics and, in some cases, can even rival fully supervised methods that require nearly 9,000 labeled samples. Additionally, PGA demonstrates practical applicability in the real world by performing the GraspMine task with a physical robot.

PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

LiteGrasp: A Light Robotic Grasp Detection Via Semi-Supervised Knowledge Distillation

GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy

PROGrasp: Pragmatic Human-Robot Communication for Object Grasping

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

GraspGF: Learning Score-based Grasping Primitive for Human-assisting Dexterous Grasping

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Intelligent Grasping with Natural Human-Robot Interaction.

Robot Instance Segmentation with Few Annotations for Grasping

A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch

VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

MVGrasp: Real-time multi-view 3D object grasping in highly cluttered environments

On-Policy Pixel-Level Grasping Across the Gap Between Simulation and Reality

Deep Learning Method for Grasping Novel Objects Using Dexterous Hands

Grasp as You Say: Language-guided Dexterous Grasp Generation

GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

Implementation and Optimization of Grasping Learning with Dual-modal Soft Gripper.

On Automated Object Grasping for Intelligent Prosthetic Hands Using Machine Learning

Learning Robust Real-World Dexterous Grasping Policies via Implicit Shape Augmentation