Abstract:For a robotic grasping task in which diverse unseen target objects exist in a cluttered environment, some deep learning-based methods have achieved state-of-the-art results using visual input directly. In contrast, actor-critic deep reinforcement learning (RL) methods typically perform very poorly when grasping diverse objects, especially when learning from raw images and sparse rewards. To make these RL techniques feasible for vision-based grasping tasks, we employ state representation learning (SRL), where we encode essential information first for subsequent use in RL. However, typical representation learning procedures are unsuitable for extracting pertinent information for learning the grasping skill, because the visual inputs for representation learning, where a robot attempts to grasp a target object in clutter, are extremely complex. We found that preprocessing based on the disentanglement of a raw input image is the key to effectively capturing a compact representation. This enables deep RL to learn robotic grasping skills from highly varied and diverse visual inputs. We demonstrate the effectiveness of this approach with varying levels of disentanglement in a realistic simulated environment.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in a cluttered environment, the performance bottleneck encountered when using the vision - based Deep Reinforcement Learning (DRL) method to enable robots to grasp diverse and unseen target objects. Specifically, the traditional actor - critic deep reinforcement learning method performs very poorly when directly learning grasping skills from raw images and sparse rewards. Therefore, the author proposes to improve the learning effect of these DRL methods through State Representation Learning (SRL), especially by disentangling the original input images. ### Core problems of the paper 1. **Limitations of existing methods**: - In a cluttered environment, it is difficult for the vision - based actor - critic deep reinforcement learning method to effectively learn to grasp diverse and unseen objects. - When directly learning grasping skills from raw images and sparse rewards, these methods perform very poorly. 2. **Proposed solutions**: - Extract key information related to the grasping task through State Representation Learning (SRL), especially by disentangling the original input images. - The disentangled images can better capture compact state representations, thereby improving the effect of deep reinforcement learning. ### Specific measures - **Disentanglement levels**: - **L1: Visual attention**: Only keep the target object and its surrounding area and remove the irrelevant background. - **L2: Separation of internal and external information**: Separate the internal information including the robotic arm and the external information including the target object. - **L3: Separation of what and where streams**: Further decompose the internal and external images into position information and appearance information respectively. - **Experimental verification**: - In a simulated environment, train the SRL model using disentangled images at different levels and combine actor - critic DRL algorithms (such as DDPG and D4PG) to learn the grasping task. - Indirectly evaluate the effectiveness of the SRL algorithm through the grasping success rate. ### Formula presentation - **Policy gradient formula in DDPG**: \[ J(\theta)=E_{s,a}[Q^{\pi_\theta}(s,a)|a = \pi_\theta(s)] \] \[ \nabla_\theta J(\theta)\approx E_\rho[\nabla_\theta\pi_\theta(s)\nabla_a Q^{\pi_\theta}(s,a)|a = \pi_\theta(s)] \] - **Vae loss function**: \[ L_{VAE}(\theta,\phi)=KL(q_\theta(z|x)\|p(z))-E_{q_\theta(z|x)}[\log p_\phi(x|z)] \] Through these measures, the paper aims to improve the grasping ability of deep reinforcement learning in complex visual environments, especially when dealing with diverse and unseen objects.

Acceleration of Actor-Critic Deep Reinforcement Learning for Visual Grasping in Clutter by State Representation Learning Based on Disentanglement of a Raw Input Image

InterRep: A Visual Interaction Representation for Robotic Grasping

Ensemble Bootstrapped Deep Deterministic Policy Gradient For Vision-Based Robotic Grasping

DexRepNet: Learning Dexterous Robotic Grasping Network with Geometric and Spatial Hand-Object Representations

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

Towards Generalization and Data Efficient Learning of Deep Robotic Grasping

Learning to Regrasp Using Visual–Tactile Representation-Based Reinforcement Learning

Deep Reinforcement Learning-Based Robotic Grasping in Clutter and Occlusion

Reinforcement Learning with Decoupled State Representation for Robot Manipulations

Weakly Supervised Disentangled Representation for Goal-conditioned Reinforcement Learning

A Deep Learning Approach to Grasping the Invisible

Hierarchical Policies for Cluttered-Scene Grasping with Latent Plans

Learn to grasp unknown objects in robotic manipulation

DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction

Task-Induced Representation Learning

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance

Visual Reinforcement Learning with Self-Supervised 3D Representations

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Stabilizing Visual Reinforcement Learning Via Asymmetric Interactive Cooperation