Towards Generalization and Data Efficient Learning of Deep Robotic Grasping

Zhixin Chen,Mengxiang Lin,Zhixin Jia,Shibo Jian
DOI: https://doi.org/10.48550/arXiv.2007.00982
2020-07-02
Abstract:Deep reinforcement learning (DRL) has been proven to be a powerful paradigm for learning complex control policy autonomously. Numerous recent applications of DRL in robotic grasping have successfully trained DRL robotic agents end-to-end, mapping visual inputs into control instructions directly, but the amount of training data required may hinder these applications in practice. In this paper, we propose a DRL based robotic visual grasping framework, in which visual perception and control policy are trained separately rather than end-to-end. The visual perception produces physical descriptions of grasped objects and the policy takes use of them to decide optimal actions based on DRL. Benefiting from the explicit representation of objects, the policy is expected to be endowed with more generalization power over new objects and environments. In addition, the policy can be trained in simulation and transferred in real robotic system without any further training. We evaluate our framework in a real world robotic system on a number of robotic grasping tasks, such as semantic grasping, clustered object grasping, moving object grasping. The results show impressive robustness and generalization of our system.
Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **to improve the generalization ability and data efficiency of deep reinforcement learning in robotic grasping tasks**. Specifically, the author points out that although deep reinforcement learning (DRL) has made remarkable progress in robotic grasping tasks, these methods usually require a large amount of training data, which is very time - consuming and resource - intensive in practical applications. In addition, most existing DRL methods have poor generalization ability when facing new objects or environments, limiting their applications in the real world. To solve these problems, the paper proposes a new framework, which reduces the required amount of data and improves the generalization ability of the system by separating the training of visual perception and control strategies. Specific improvement measures include: 1. **Separate training of visual perception and control strategies**: The traditional end - to - end training method requires a large amount of data to simultaneously optimize visual feature extraction and control strategies. In the framework proposed in this paper, the visual perception module (based on Mask R - CNN) and the control strategy module (based on PPO) are trained separately. The visual perception module is responsible for extracting the semantic and spatial information of objects from images, and the control strategy module uses this information to determine the optimal action. This method not only reduces the dependence on a large amount of real - world data, but also improves the generalization ability of the system. 2. **Policy training in the simulation environment**: To further reduce the data requirements, the control policy in the paper is trained in the simulation environment instead of directly on the real robot. This can quickly generate a large amount of training data without considering the physical limitations of the real world. The trained policy can be directly transferred to the real robot for use without additional fine - tuning. 3. **Enhancing generalization ability**: By providing explicit descriptions of objects (such as category and pose) instead of directly learning low - dimensional implicit representations from the original image, the policy can better generalize to new objects and environments. This helps to overcome the problem of poor generalization ability caused by unclear implicit representations in existing methods. Through the above methods, the paper demonstrates the robustness and generalization ability of its system in a variety of challenging tasks, including semantic grasping, multi - target grasping, dense target grasping, and moving target grasping. The experimental results show that this method can achieve efficient learning and good performance with less real - world interaction data.