Abstract:In the realm of future home-assistant robots, 3D articulated object manipulation is essential for enabling robots to interact with their environment. Many existing studies make use of 3D point clouds as the primary input for manipulation policies. However, this approach encounters challenges due to data sparsity and the significant cost associated with acquiring point cloud data, which can limit its practicality. In contrast, RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information. To overcome these limitations, we present a novel image-based robotic manipulation framework. This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry. Initially, the system employs an eye-on-hand RGB camera to capture an overall view of the target object. It predicts the initial depth map and a coarse affordance map. The affordance map indicates actionable areas on the object and serves as a constraint for selecting subsequent viewpoints. Based on the global visual prior, we adaptively identify the optimal next viewpoint for a detailed observation of the potential manipulation success area. We leverage geometric consistency to fuse the views, resulting in a refined depth map and a more precise affordance map for robot manipulation decisions. By comparing with prior works that adopt point clouds or RGB images as inputs, we demonstrate the effectiveness and practicality of our method. In the project webpage (<a class="link-external link-https" href="https://sites.google.com/view/imagemanip" rel="external noopener nofollow">this https URL</a>), real world experiments further highlight the potential of our method for practical deployment.

Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions

Learning Object Affordance with Contact and Grasp Generation

AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning

PartAfford: Part-level Affordance Discovery from 3D Objects

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation

Language-Conditioned Affordance-Pose Detection in 3D Point Clouds

Where2Explore: Few-shot Affordance Learning for Unseen Novel Categories of Articulated Objects

Articulated Object Manipulation with Coarse-to-fine Affordance for Mitigating the Effect of Point Cloud Noise

RLAfford: End-to-End Affordance Learning for Robotic Manipulation

Manipulation-Oriented Object Perception in Clutter through Affordance Coordinate Frames

Self-Supervised Learning of Action Affordances as Interaction Modes

Learning Foresightful Dense Visual Affordance for Deformable Object Manipulation

Visual-Geometric Collaborative Guidance for Affordance Learning

MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects

ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection

VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects

Learning Interactive Affordance for Human-Robot Interaction

3D-TAFS: A Training-free Framework for 3D Affordance Segmentation

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models