Abstract:Policy search is formulated as a Bayesian framework where return is likelihood function.Risk-sensitive multiplicative return has been used for the reward structure.Likelihood function is combined with an uninformative prior to establish a posterior.Markov chain Monte Carlo (MCMC) is used to explore the high-reward regions of the policy space without the need for any gradient calculations.A prior distribution over the policy parameters are included as some domain knowledge helping to increase the learning speed and improving the exploration process of the MCMC algorithm by keeping the agent away from unstable state-spaces.Parameterized control policy is learned by a MCMC-based reinforcement learning (RL) algorithm.The proposed MCMC-based RL method is a gradient-free and model-free algorithm capable of incorporating uncertainties involved.Contribute to the validation of the proposed MCMC-based RL algorithm by performing real-world experiments on a physical setup of a 2-degree of freedom robotic manipulator.Reinforcement learning methods are being applied to control problems in robotics domain. These algorithms are well suited for dealing with the continuous large scale state spaces in robotics field. Even though policy search methods related to stochastic gradient optimization algorithms have become a successful candidate for coping with challenging robotics and control problems in recent years, they may become unstable when abrupt variations occur in gradient computations. Moreover, they may end up with a locally optimal solution. To avoid these disadvantages, a Markov chain Monte Carlo (MCMC) algorithm for policy learning under the RL configuration is proposed. The policy space is explored in a non-contiguous manner such that higher reward regions have a higher probability of being visited. The proposed algorithm is applied in a risk-sensitive setting where the reward structure is multiplicative. Our method has the advantages of being model-free and gradient-free, as well as being suitable for real-world implementation. The merits of the proposed algorithm are shown with experimental evaluations on a 2-Degree of Freedom robot arm. The experiments demonstrate that it can perform a thorough policy space search while maintaining adequate control performance and can learn a complex trajectory control task within a small finite number of iteration steps.

Local Policy Optimization for Trajectory-Centric Reinforcement Learning

SOMTP: A Self-Supervised Learning-Based Optimizer for MPC-Based Safe Trajectory Planning Problems in Robotics

Adaptive Model Prediction Control-Based Multi-Terrain Trajectory Tracking Framework for Mobile Spherical Robots

The Power of Learned Locally Linear Models for Nonlinear Policy Optimization

Trajectory Optimization for Unknown Constrained Systems using Reinforcement Learning

Reparameterized Policy Learning for Multimodal Trajectory Optimization

Guided Policy Search using Sequential Convex Programming for Initialization of Trajectory Optimization Algorithms

Gradient-Based Trajectory Optimization With Learned Dynamics

Adversarially Regularized Policy Learning Guided by Trajectory Optimization

Optimization-based Trajectory Tracking Approach for Multi-rotor Aerial Vehicles in Unknown Environments

Continuous-Time Trajectory Optimization for Decentralized Multi-Robot Navigation

Trajectory-Oriented Policy Optimization with Sparse Rewards

Lyapunov-based Safe Policy Optimization for Continuous Control

A real-world application of Markov chain Monte Carlo method for Bayesian trajectory control of a robotic manipulator

Model-based Policy Optimization using Symbolic World Model

Trajectory optimization for a class of robots belonging to Constrained Collaborative Mobile Agents (CCMA) family

Bidirectional Model-based Policy Optimization

Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization

Learning to Constrain Policy Optimization with Virtual Trust Region

Enabling Efficient, Reliable Real-World Reinforcement Learning with Approximate Physics-Based Models

Bridging the gap between Learning-to-plan, Motion Primitives and Safe Reinforcement Learning