Abstract:Deep reinforcement learning (RL) algorithms frequently require prohibitive interaction experience to ensure the quality of learned policies. The limitation is partly because the agent cannot learn much from the many low-quality trials in early learning phase, which results in low learning rate. Focusing on addressing this limitation, this paper makes a twofold contribution. First, we develop an algorithm, called Experience Grafting (EG), to enable RL agents to reorganize segments of the few high-quality trajectories from the experience pool to generate many synthetic trajectories while retaining the quality. Second, building on EG, we further develop an AutoEG agent that automatically learns to adjust the grafting-based learning strategy. Results collected from a set of six robotic control environments show that, in comparison to a standard deep RL algorithm (DDPG), AutoEG increases the speed of learning process by at least 30%.

What problem does this paper attempt to address?

This paper attempts to solve a key problem encountered by Deep Reinforcement Learning (DRL) algorithms during the learning process: that is, a large amount of interactive experience is required to ensure the quality of the learned policy. Specifically, in the early learning stage, DRL algorithms have a large number of low - quality trials, resulting in low learning efficiency, which in turn limits their application in complex tasks. To solve this problem, the paper makes two main contributions: 1. **Experience Grafting (EG)**: An algorithm named "Experience Grafting" is developed, enabling reinforcement learning agents to reorganize a small number of high - quality trajectory segments from the experience pool to generate many synthetic trajectories while maintaining the quality of these trajectories. 2. **Automated Experience Grafting (AutoEG)**: Further developed on the basis of EG, an agent that can automatically adjust the graft - based learning strategy is constructed. In this way, AutoEG can dynamically optimize its grafting strategy at different learning stages, thereby improving learning efficiency. ### Main Technical Details - **Distance Function**: Used to measure the similarity between two states, defined as: \[ \text{Dis}(s, s') = W(P(s), P(s')) \] where \(P\) is a function that normalizes the state vector into a distribution representation, and \(W(P_1, P_2)\) is the first Wasserstein distance (or Earth Mover's Distance), defined as: \[ W(P_1, P_2)=\inf_{\gamma\in\Pi(P_1, P_2)}E_{(x,y)\sim\gamma}[\|x - y\|] \] - **Error Function**: Measures the grafting error between the head segment and the tail segment, defined as: \[ \text{Err}(\text{Seg}_1,\text{Seg}_2)=\text{Dis}(\text{Term}(\text{Seg}_1),\text{Init}(\text{Seg}_2)) \] where \(\text{Term}(\text{Seg})\) returns the terminal state of the segment, and \(\text{Init}(\text{Seg})\) returns the initial state of the segment. - **Union Function**: When the grafting error of two segments is less than the threshold \(\epsilon\), a synthetic trajectory is generated: \[ \text{Uni}(\text{Seg}_1,\text{Seg}_2)= \begin{cases} \text{append}(\text{Seg}_1,\text{Seg}_2)&\text{if }\text{Err}(\text{Seg}_1,\text{Seg}_2)<\epsilon\\ \emptyset&\text{otherwise} \end{cases} \] - **Performance Quality Function**: Measures the performance quality of a trajectory with cumulative rewards: \[ \text{Qua}(\text{Trj}) = R_0(\text{Trj}) \] ### Experimental Results The paper evaluates EG and AutoEG through six robot control environments (such as Walker2d, HalfCheetah, etc.). The experimental results show that in most environments, EG learns faster than the standard DDPG, and AutoEG performs the best in all environments, with an average learning speed improvement of at least 30%. For example, in the Walker2d environment, the AUC improvement rate of AutoEG is close to 100%, which means that it almost doubles the cumulative rewards. In addition, AutoEG also significantly outperforms other methods in terms of the final policy quality. ### Summary This paper, by introducing Expe

AutoEG: Automated Experience Grafting for Off-Policy Deep Reinforcement Learning

Generalize Robot Learning from Demonstration to Variant Scenarios with Evolutionary Policy Gradient

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Synthetic Experience Replay

Experience Augmentation: Boosting and Accelerating Off-Policy Multi-Agent Reinforcement Learning

Evolution-Guided Policy Gradient in Reinforcement Learning

Efficient Reinforcement-Learning Control Algorithm Using Experience Reuse

Deep Reinforcement Learning using Genetic Algorithm for Parameter Optimization

ACDER: Augmented Curiosity-Driven Experience Replay

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Automatic Data Augmentation for Generalization in Deep Reinforcement Learning

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Safe Driving Via Expert Guided Policy Optimization

Replay across Experiments: A Natural Extension of Off-Policy RL

Learning to drive via Apprenticeship Learning and Deep Reinforcement Learning

Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning

Automatic Data Augmentation for Generalization in Reinforcement Learning

Robustness and Performance of Deep Reinforcement Learning.

Episodic Reinforcement Learning with Expanded State-reward Space

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning