Acceleration of Actor-Critic Deep Reinforcement Learning for Visual Grasping in Clutter by State Representation Learning Based on Disentanglement of a Raw Input Image

Taewon Kim,Yeseong Park,Youngbin Park,Il Hong Suh
DOI: https://doi.org/10.48550/arXiv.2002.11903
2020-02-27
Abstract:For a robotic grasping task in which diverse unseen target objects exist in a cluttered environment, some deep learning-based methods have achieved state-of-the-art results using visual input directly. In contrast, actor-critic deep reinforcement learning (RL) methods typically perform very poorly when grasping diverse objects, especially when learning from raw images and sparse rewards. To make these RL techniques feasible for vision-based grasping tasks, we employ state representation learning (SRL), where we encode essential information first for subsequent use in RL. However, typical representation learning procedures are unsuitable for extracting pertinent information for learning the grasping skill, because the visual inputs for representation learning, where a robot attempts to grasp a target object in clutter, are extremely complex. We found that preprocessing based on the disentanglement of a raw input image is the key to effectively capturing a compact representation. This enables deep RL to learn robotic grasping skills from highly varied and diverse visual inputs. We demonstrate the effectiveness of this approach with varying levels of disentanglement in a realistic simulated environment.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in a cluttered environment, the performance bottleneck encountered when using the vision - based Deep Reinforcement Learning (DRL) method to enable robots to grasp diverse and unseen target objects. Specifically, the traditional actor - critic deep reinforcement learning method performs very poorly when directly learning grasping skills from raw images and sparse rewards. Therefore, the author proposes to improve the learning effect of these DRL methods through State Representation Learning (SRL), especially by disentangling the original input images. ### Core problems of the paper 1. **Limitations of existing methods**: - In a cluttered environment, it is difficult for the vision - based actor - critic deep reinforcement learning method to effectively learn to grasp diverse and unseen objects. - When directly learning grasping skills from raw images and sparse rewards, these methods perform very poorly. 2. **Proposed solutions**: - Extract key information related to the grasping task through State Representation Learning (SRL), especially by disentangling the original input images. - The disentangled images can better capture compact state representations, thereby improving the effect of deep reinforcement learning. ### Specific measures - **Disentanglement levels**: - **L1: Visual attention**: Only keep the target object and its surrounding area and remove the irrelevant background. - **L2: Separation of internal and external information**: Separate the internal information including the robotic arm and the external information including the target object. - **L3: Separation of what and where streams**: Further decompose the internal and external images into position information and appearance information respectively. - **Experimental verification**: - In a simulated environment, train the SRL model using disentangled images at different levels and combine actor - critic DRL algorithms (such as DDPG and D4PG) to learn the grasping task. - Indirectly evaluate the effectiveness of the SRL algorithm through the grasping success rate. ### Formula presentation - **Policy gradient formula in DDPG**: \[ J(\theta)=E_{s,a}[Q^{\pi_\theta}(s,a)|a = \pi_\theta(s)] \] \[ \nabla_\theta J(\theta)\approx E_\rho[\nabla_\theta\pi_\theta(s)\nabla_a Q^{\pi_\theta}(s,a)|a = \pi_\theta(s)] \] - **Vae loss function**: \[ L_{VAE}(\theta,\phi)=KL(q_\theta(z|x)\|p(z))-E_{q_\theta(z|x)}[\log p_\phi(x|z)] \] Through these measures, the paper aims to improve the grasping ability of deep reinforcement learning in complex visual environments, especially when dealing with diverse and unseen objects.