Abstract:Most of the humanoid social robots currently diffused are designed only for verbal and animated interactions with users, and despite being equipped with two upper arms for interactive animation, they lack object manipulation capabilities. In this paper, we propose the MONOCULAR (eMbeddable autONomous ObjeCt manipULAtion Routines) framework, which implements a set of routines to add manipulation functionalities to social robots by exploiting the functional data fusion of two RGB cameras and a 3D depth sensor placed in the head frame. The framework is designed to: (i) localize specific objects to be manipulated via RGB cameras; (ii) define the characteristics of the shelf on which they are placed; and (iii) autonomously adapt approach and manipulation routines to avoid collisions and maximize grabbing accuracy. To localize the item on the shelf, MONOCULAR exploits an embeddable version of the You Only Look Once (YOLO) object detector. The RGB camera outcomes are also used to estimate the height of the shelf using an edge-detecting algorithm. Based on the item's position and the estimated shelf height, MONOCULAR is designed to select between two possible routines that dynamically optimize the approach and object manipulation parameters according to the real-time analysis of RGB and 3D sensor frames. These two routines are optimized for a central or lateral approach to objects on a shelf. The MONOCULAR procedures are designed to be fully automatic, intrinsically protecting sensitive users' data and stored home or hospital maps. MONOCULAR was optimized for Pepper by SoftBank Robotics. To characterize the proposed system, a case study in which Pepper is used as a drug delivery operator is proposed. The case study is divided into: (i) pharmaceutical package search; (ii) object approach and manipulation; and (iii) delivery operations. Experimental data showed that object manipulation routines for laterally placed objects achieves a best grabbing success rate of 96%, while the routine for centrally placed objects can reach 97% for a wide range of different shelf heights. Finally, a proof of concept is proposed here to demonstrate the applicability of the MONOCULAR framework in a real-life scenario.

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations

Vision-based Robot Manipulation Learning via Human Demonstrations

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

A Human–Robot Collaboration Method Using a Pose Estimation Network for Robot Learning of Assembly Manipulation Trajectories From Demonstration Videos

Learning Generalizable 3D Manipulation With 10 Demonstrations

Learning Actions from Human Demonstration Video for Robotic Manipulation

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Learning Multi-Step Manipulation Tasks from A Single Human Demonstration

Learning Human-to-Robot Dexterous Handovers for Anthropomorphic Hand

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Skill Learning Framework for Human-Robot Interaction and Manipulation Tasks

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Learning Robotic Manipulation through Visual Planning and Acting

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

An Embedded Framework for Fully Autonomous Object Manipulation in Robotic-Empowered Assisted Living

Semantic learning from keyframe demonstration using object attribute constraints

Learning Manipulation by Predicting Interaction

Deep Learning-Based Ensemble Approach for Autonomous Object Manipulation with an Anthropomorphic Soft Robot Hand

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos