Abstract:Reinforcement learning (RL) constitutes an effective method of controlling dynamic systems without prior knowledge. One of the most important and difficult problems in RL is the improvement of data efficiency. Probabilistic inference for learning control (PILCO) is a state-of-the-art data-efficient framework that uses a Gaussian process to model dynamic systems. However, it only focuses on optimizing cumulative rewards and does not consider the accuracy of a dynamic model, which is an important factor for controller learning. To further improve the data efficiency of PILCO, we propose its active exploration version (AEPILCO) that utilizes information entropy to describe samples. In the policy evaluation stage, we incorporate an information entropy criterion into long-term sample prediction. Through the informative policy evaluation function, our algorithm obtains informative policy parameters in the policy improvement stage. Using the policy parameters in the actual execution produces an informative sample set; this is helpful in learning an accurate dynamic model. Thus, the AEPILCO algorithm improves data efficiency by learning an accurate dynamic model by actively selecting informative samples based on the information entropy criterion. We demonstrate the validity and efficiency of the proposed algorithm for several challenging controller problems involving a cart pole, a pendubot, a double pendulum, and a cart double pendulum. The AEPILCO algorithm can learn a controller using fewer trials compared to PILCO. This is verified through theoretical analysis and experimental results.

Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse

Generalized Proximal Policy Optimization with Sample Reuse

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

Off-Policy RL Algorithms Can be Sample-Efficient for Continuous Control via Sample Multiple Reuse

Active Policy Improvement from Multiple Black-box Oracles

Generalised Policy Improvement with Geometric Policy Composition

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Safe Deep Policy Adaptation

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Model-based Policy Optimization using Symbolic World Model

An Active Exploration Method for Data Efficient Reinforcement Learning

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Self-Improvement for Neural Combinatorial Optimization: Sample without Replacement, but Improvement

Policy Optimization over General State and Action Spaces

Sample Efficient Deep Reinforcement Learning with Online State Abstraction and Causal Transformer Model Prediction

Efficient sample reuse in policy gradients with parameter-based exploration

Dynamic Policy Programming with Descending Regularization for Efficient Reinforcement Learning Control