Abstract:Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and$\textbf{-In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36\%$ parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at <a class="link-external link-https" href="https://yanjieze.com/H-InDex" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the efficiency issue of multi-fingered robotic hands in performing complex dexterous manipulation tasks. Specifically, the paper proposes a method called H-InDex (Hand-Informed Visual Reinforcement Learning Framework) that enhances the dexterous manipulation capabilities of robots by leveraging visual representations of human hands. The main objectives of the paper are: 1. **Improve Sample Efficiency**: Enable robots to learn complex dexterous manipulation tasks more efficiently with a limited number of interactions. 2. **Leverage Human Hand Priors**: Transfer the dexterity of human hands to robotic hands' operations through a pre-trained 3D hand pose estimation model. 3. **Adapt to Different Tasks**: Validate the effectiveness of the method across various dexterous manipulation tasks, including hammering, door handling, writing, pouring water, and placing objects. ### Method Overview The H-InDex framework consists of three stages: 1. **Representation Pre-training**: Pre-train visual representations using a 3D hand pose estimation task to enable the model to understand the dexterity of human hands. 2. **Representation Offline Adaptation**: Fine-tune only the affine transformation parameters of the BatchNorm layers (approximately 0.18% of the total parameters) in the pre-trained model through a self-supervised keypoint detection task to adapt to the morphological and structural differences of robotic hands. 3. **Reinforcement Learning**: During the reinforcement learning stage, freeze the visual representations and use Exponential Moving Average (EMA) to update the mean and variance of the BatchNorm layers to adapt to the changing observation distribution. ### Experimental Results The paper conducted experiments on 12 challenging dexterous manipulation tasks and compared the results with several strong baseline models (such as VC-1, MVP, R3M, RRL). The results show that H-InDex significantly outperforms these baseline models in most tasks, particularly excelling in sample efficiency. ### Main Contributions 1. **Proposed a New Visual Reinforcement Learning Framework**: H-InDex effectively enhances the dexterous manipulation capabilities of robots by utilizing rich hand information. 2. **Validated Effectiveness Across Multiple Challenging Tasks**: Demonstrated the superior performance of H-InDex in 12 dexterous manipulation tasks through experiments. 3. **Provided Valuable Insights**: Explored the direct application of pre-trained models in dexterous manipulation tasks, particularly the application of 3D hand pose estimation models. Through these contributions, the paper offers new ideas and methods for research in the field of robotic dexterous manipulation.

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

DexRepNet: Learning Dexterous Robotic Grasping Network with Geometric and Spatial Hand-Object Representations

DexH2R: Task-oriented Dexterous Manipulation from Human to Robots

RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

DexDeform: Dexterous Deformable Object Manipulation with Human Demonstrations and Differentiable Physics

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

MoDex: Planning High-Dimensional Dexterous Control via Learning Neural Hand Models

Holo-Dex: Teaching Dexterity with Immersive Mixed Reality

Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning

DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Bi-DexHands: Towards Human-Level Bimanual Dexterous Manipulation

Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost

Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance

Physics-Based Dexterous Manipulations with Estimated Hand Poses and Residual Reinforcement Learning

Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play

Dext-Gen: Dexterous Grasping in Sparse Reward Environments with Full Orientation Control

A High-Efficient Reinforcement Learning Approach for Dexterous Manipulation

DexPoint: Generalizable Point Cloud Reinforcement Learning for Sim-to-Real Dexterous Manipulation