Abstract:In the realm of future home-assistant robots, 3D articulated object manipulation is essential for enabling robots to interact with their environment. Many existing studies make use of 3D point clouds as the primary input for manipulation policies. However, this approach encounters challenges due to data sparsity and the significant cost associated with acquiring point cloud data, which can limit its practicality. In contrast, RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information. To overcome these limitations, we present a novel image-based robotic manipulation framework. This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry. Initially, the system employs an eye-on-hand RGB camera to capture an overall view of the target object. It predicts the initial depth map and a coarse affordance map. The affordance map indicates actionable areas on the object and serves as a constraint for selecting subsequent viewpoints. Based on the global visual prior, we adaptively identify the optimal next viewpoint for a detailed observation of the potential manipulation success area. We leverage geometric consistency to fuse the views, resulting in a refined depth map and a more precise affordance map for robot manipulation decisions. By comparing with prior works that adopt point clouds or RGB images as inputs, we demonstrate the effectiveness and practicality of our method. In the project webpage (<a class="link-external link-https" href="https://sites.google.com/view/imagemanip" rel="external noopener nofollow">this https URL</a>), real world experiments further highlight the potential of our method for practical deployment.

Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map

Real-world Instance-specific Image Goal Navigation for Service Robots: Bridging the Domain Gap with Contrastive Learning

Open-set 3D semantic instance maps for vision language navigation – O3D-SIM

Object-aware Semantic Mapping of Indoor Scenes Using Octomap

Real-time 3D Semantic Scene Perception for Egocentric Robots with Binocular Vision

3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D

Preferential Multi-Target Search in Indoor Environments using Semantic SLAM

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Navigating to Objects Specified by Images

Object-Oriented 3D Semantic Mapping Based on Instance Segmentation

Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

MVGrasp: Real-time multi-view 3D object grasping in highly cluttered environments

3D attention-driven depth acquisition for object identification.

3D Move to See: Multi-perspective visual servoing for improving object views with semantic segmentation

Interactive Semantic Map Representation for Skill-based Visual Object Navigation

ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection

MORE: Simultaneous Multi-View 3D Object Recognition and Pose Estimation

Visual-Inertial Multi-Instance Dynamic SLAM with Object-level Relocalisation

Efficient Multi-Object Detection and Smart Navigation Using Artificial Intelligence for Visually Impaired People

Active Robot Vision for Distant Object Change Detection: A Lightweight Training Simulator Inspired by Multi-Armed Bandits