Abstract:Deep reinforcement learning (DRL) has been proven to be a powerful paradigm for learning complex control policy autonomously. Numerous recent applications of DRL in robotic grasping have successfully trained DRL robotic agents end-to-end, mapping visual inputs into control instructions directly, but the amount of training data required may hinder these applications in practice. In this paper, we propose a DRL based robotic visual grasping framework, in which visual perception and control policy are trained separately rather than end-to-end. The visual perception produces physical descriptions of grasped objects and the policy takes use of them to decide optimal actions based on DRL. Benefiting from the explicit representation of objects, the policy is expected to be endowed with more generalization power over new objects and environments. In addition, the policy can be trained in simulation and transferred in real robotic system without any further training. We evaluate our framework in a real world robotic system on a number of robotic grasping tasks, such as semantic grasping, clustered object grasping, moving object grasping. The results show impressive robustness and generalization of our system.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is **to improve the generalization ability and data efficiency of deep reinforcement learning in robotic grasping tasks**. Specifically, the author points out that although deep reinforcement learning (DRL) has made remarkable progress in robotic grasping tasks, these methods usually require a large amount of training data, which is very time - consuming and resource - intensive in practical applications. In addition, most existing DRL methods have poor generalization ability when facing new objects or environments, limiting their applications in the real world. To solve these problems, the paper proposes a new framework, which reduces the required amount of data and improves the generalization ability of the system by separating the training of visual perception and control strategies. Specific improvement measures include: 1. **Separate training of visual perception and control strategies**: The traditional end - to - end training method requires a large amount of data to simultaneously optimize visual feature extraction and control strategies. In the framework proposed in this paper, the visual perception module (based on Mask R - CNN) and the control strategy module (based on PPO) are trained separately. The visual perception module is responsible for extracting the semantic and spatial information of objects from images, and the control strategy module uses this information to determine the optimal action. This method not only reduces the dependence on a large amount of real - world data, but also improves the generalization ability of the system. 2. **Policy training in the simulation environment**: To further reduce the data requirements, the control policy in the paper is trained in the simulation environment instead of directly on the real robot. This can quickly generate a large amount of training data without considering the physical limitations of the real world. The trained policy can be directly transferred to the real robot for use without additional fine - tuning. 3. **Enhancing generalization ability**: By providing explicit descriptions of objects (such as category and pose) instead of directly learning low - dimensional implicit representations from the original image, the policy can better generalize to new objects and environments. This helps to overcome the problem of poor generalization ability caused by unclear implicit representations in existing methods. Through the above methods, the paper demonstrates the robustness and generalization ability of its system in a variety of challenging tasks, including semantic grasping, multi - target grasping, dense target grasping, and moving target grasping. The experimental results show that this method can achieve efficient learning and good performance with less real - world interaction data.

Towards Generalization and Data Efficient Learning of Deep Robotic Grasping

DexRepNet: Learning Dexterous Robotic Grasping Network with Geometric and Spatial Hand-Object Representations

Ensemble Bootstrapped Deep Deterministic Policy Gradient For Vision-Based Robotic Grasping

InterRep: A Visual Interaction Representation for Robotic Grasping

A Cascaded Deep Learning Framework for Real-time and Robust Grasp Planning

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

Learn to grasp unknown objects in robotic manipulation

A learning framework for semantic reach-to-grasp tasks integrating machine learning and optimization.

A data-efficient goal-directed deep reinforcement learning method for robot visuomotor skill

Learning to Regrasp Using Visual–Tactile Representation-Based Reinforcement Learning

Implementation and Optimization of Grasping Learning with Dual-modal Soft Gripper.

Deep Reinforcement Learning Enhanced Convolutional Neural Networks for Robotic Grasping

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

GAP-RL: Grasps As Points for RL Towards Dynamic Object Grasping

Dext-Gen: Dexterous Grasping in Sparse Reward Environments with Full Orientation Control

A digital twin-based sim-to-real transfer for deep reinforcement learning-enabled industrial robot grasping

A Deep Learning Approach to Grasping the Invisible

Robot Control in Human Environment Using Deep Reinforcement Learning and Convolutional Neural Network.

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

An Efficient Generalizable Framework for Visuomotor Policies via Control-aware Augmentation and Privilege-guided Distillation

Transferable Active Grasping and Real Embodied Dataset