Abstract:In the realm of future home-assistant robots, 3D articulated object manipulation is essential for enabling robots to interact with their environment. Many existing studies make use of 3D point clouds as the primary input for manipulation policies. However, this approach encounters challenges due to data sparsity and the significant cost associated with acquiring point cloud data, which can limit its practicality. In contrast, RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information. To overcome these limitations, we present a novel image-based robotic manipulation framework. This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry. Initially, the system employs an eye-on-hand RGB camera to capture an overall view of the target object. It predicts the initial depth map and a coarse affordance map. The affordance map indicates actionable areas on the object and serves as a constraint for selecting subsequent viewpoints. Based on the global visual prior, we adaptively identify the optimal next viewpoint for a detailed observation of the potential manipulation success area. We leverage geometric consistency to fuse the views, resulting in a refined depth map and a more precise affordance map for robot manipulation decisions. By comparing with prior works that adopt point clouds or RGB images as inputs, we demonstrate the effectiveness and practicality of our method. In the project webpage (<a class="link-external link-https" href="https://sites.google.com/view/imagemanip" rel="external noopener nofollow">this https URL</a>), real world experiments further highlight the potential of our method for practical deployment.

Understanding 3D Object Interaction from a Single Image

ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection

IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes

Detecting and Recognizing Human-Object Interactions

Latent Space Planning for Multi-Object Manipulation with Environment-Aware Relational Classifiers

Multi-modal Interaction with Transformers: Bridging Robots and Human with Natural Language

Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Modeling 4d Human-Object Interactions for Event and Object Recognition

Latent Space Planning for Multiobject Manipulation With Environment-Aware Relational Classifiers

Parallel disentangling network for human–object interaction detection

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

Localization and Completion for 3D Object Interactions

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

HOIG: End-to-End Human-Object Interactions Grounding with Transformers

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization.

InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild