Abstract:Purpose Current reinforcement learning (RL) algorithms are facing issues such as low learning efficiency and poor generalization performance, which significantly limit their practical application in real robots. This paper aims to adopt a hybrid model-based and model-free policy search method with multi-timescale value function tuning, aiming to allow robots to learn complex motion planning skills in multi-goal and multi-constraint environments with a few interactions. Design/methodology/approach A goal-conditioned model-based and model-free search method with multi-timescale value function tuning is proposed in this paper. First, the authors construct a multi-goal, multi-constrained policy optimization approach that fuses model-based policy optimization with goal-conditioned, model-free learning. Soft constraints on states and controls are applied to ensure fast and stable policy iteration. Second, an uncertainty-aware multi-timescale value function learning method is proposed, which constructs a multi-timescale value function network and adaptively chooses the value function planning timescales according to the value prediction uncertainty. It implicitly reduces the value representation complexity and improves the generalization performance of the policy. Findings The algorithm enables physical robots to learn generalized skills in real-world environments through a handful of trials. The simulation and experimental results show that the algorithm outperforms other relevant model-based and model-free RL algorithms. Originality/value This paper combines goal-conditioned RL and the model predictive path integral method into a unified model-based policy search framework, which improves the learning efficiency and policy optimality of motor skill learning in multi-goal and multi-constrained environments. An uncertainty-aware multi-timescale value function learning and selection method is proposed to overcome long horizon problems, improve optimal policy resolution and therefore enhance the generalization ability of goal-conditioned RL.

Multi-Timescale Ensemble -Learning for Markov Decision Process Policy Optimization

Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy Optimization

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Model-Ensemble Trust-Region Policy Optimization

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

Scalable spectral representations for multi-agent reinforcement learning in network MDPs

A Two-Stage Multi-Objective Deep Reinforcement Learning Framework.

Intrinsically Motivated Hierarchical Policy Learning in Multi-objective Markov Decision Processes

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

Non-Stationary Policy Learning for Multi-Timescale Multi-Agent Reinforcement Learning

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Model-Based Reinforcement Learning via Meta-Policy Optimization

Double Meta-Learning for Data Efficient Policy Optimization in Non-Stationary Environments

Opportunistic Learning for Markov Decision Systems with Application to Smart Robots

Learning to Switch Among Agents in a Team via 2-Layer Markov Decision Processes

Multiagent Meta-Reinforcement Learning for Adaptive Multipath Routing Optimization

A goal-conditioned policy search method with multi-timescale value function tuning

A Structure-aware Online Learning Algorithm for Markov Decision Processes

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search

Mixed Reinforcement Learning for Efficient Policy Optimization in Stochastic Environments