PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

Junghyun Kim,Gi-Cheon Kang,Jaein Kim,Seoyun Yang,Minjoon Jung,Byoung-Tak Zhang
2024-03-19
Abstract:Language-Conditioned Robotic Grasping (LCRG) aims to develop robots that comprehend and grasp objects based on natural language instructions. While the ability to understand personal objects like my wallet facilitates more natural interaction with human users, current LCRG systems only allow generic language instructions, e.g., the black-colored wallet next to the laptop. To this end, we introduce a task scenario GraspMine alongside a novel dataset aimed at pinpointing and grasping personal objects given personal indicators via learning from a single human-robot interaction, rather than a large labeled dataset. Our proposed method, Personalized Grasping Agent (PGA), addresses GraspMine by leveraging the unlabeled image data of the user's environment, called Reminiscence. Specifically, PGA acquires personal object information by a user presenting a personal object with its associated indicator, followed by PGA inspecting the object by rotating it. Based on the acquired information, PGA pseudo-labels objects in the Reminiscence by our proposed label propagation algorithm. Harnessing the information acquired from the interactions and the pseudo-labeled objects in the Reminiscence, PGA adapts the object grounding model to grasp personal objects. This results in significant efficiency while previous LCRG systems rely on resource-intensive human annotations -- necessitating hundreds of labeled data to learn my wallet. Moreover, PGA outperforms baseline methods across all metrics and even shows comparable performance compared to the fully-supervised method, which learns from 9k annotated data samples. We further validate PGA's real-world applicability by employing a physical robot to execute GrsapMine. Code and data are publicly available at <a class="link-external link-https" href="https://github.com/JHKim-snu/PGA" rel="external noopener nofollow">this https URL</a>.
Robotics,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is that current Language-Conditioned Robot Grasping (LCRG) systems primarily rely on generic language instructions to describe and manipulate objects, leading to non-intuitive human-robot interactions. For example, a user might instinctively say "grab my wallet," but existing LCRG systems might require more specific instructions, such as "grab the black wallet next to the laptop." This discrepancy forces users to adjust their instructions to fit the robot's knowledge base, resulting in "robot-centric" instructions that non-expert users may find unfamiliar and cumbersome. To solve this issue, the paper introduces a new personalized task scenario called GraspMine, along with a benchmark dataset. GraspMine aims to locate and grasp personal items using personal referential terms (e.g., "my sleeping pills"), which are difficult for general knowledge systems to handle. GraspMine requires the robot to learn personal items through minimal human-robot interaction (i.e., a single verbal introduction of the personal item), enabling more intuitive user-centric interactions. Specifically, the paper proposes a method called Personalized Grasping Agent (PGA) that achieves its goal through the following steps: 1. **Constructing a Reminiscence Memory Bank**: Collecting a series of raw images from the user's environment. 2. **Acquiring Object Information**: Obtaining information about personal items through two consecutive steps (human-robot interaction and robot-object interaction). - **Human-Robot Interaction**: The user shows the personal item to the robot and verbally describes it. - **Robot-Object Interaction**: The robot examines the item from multiple angles, obtaining multi-view images of it. 3. **Propagating Labels through the Memory Bank**: Using the acquired personal item information to propagate personal referential terms to unlabeled objects in the memory bank through visual features. 4. **Adaptive Object Localization Model**: Training an object localization model using the data obtained during interactions and pseudo-labeled objects, enabling it to accurately grasp the user's specified personal items. Experimental results show that PGA significantly outperforms baseline methods on multiple metrics and, in some cases, can even rival fully supervised methods that require nearly 9,000 labeled samples. Additionally, PGA demonstrates practical applicability in the real world by performing the GraspMine task with a physical robot.