Abstract:In order to solve the problem that the approximator of the current approximation policy iteration reinforoement learning cannot be constructed completely automatically,a reinforcement learning algorithm of Nonparametric Approximation Policy Iteration based on Dyna Framework (NPAPI-Dyna) was proposed.Sampling cache and sampling change rate were introduced to design a two stage random sampling process to collect samples.By profile tolerance and K-means clustering,core state basis function was generated through trial-and-error process.Q-value function approximator was generated by using the complete coverage of sample as the target.Greedy strategy was applied to design action selector.Access frequency of the state basis function was used to describe environmental topology features and construct environment estimation model.Learning and planning processes were combined organically by identification of Dyna framework to accelerate the speed of learning.In the simulation experiments of single inverted pendulum balance control,when the reinforcement learning error rate is 0.01,the learning success rate of algorithm reaches 100％,the minimum number of successful attempts is only 2,the average number of attempts is only 7.73,and the mean absolute deviation of angle is 3.053 8°,and the average oscillation range of angle is 2.759°.When reinforcement learning error rate is 0.1,100 independent simulation operations are performed,to learn the control strategy,Online-LSPI and BLSPI (Batch Least-Squares Policy Iteration) have to try more than 150 times on average,however NPAPI-Dyna can succeed in 50 times of attempts.The experimental results show that NPAPI-Dyna can be completely automatically constructed and adjusted to enhance the learning structure,with high learning accuracy and rapid convergence ability.

A Heuristic Dyna Optimizing Algorithm Using Approximate Model Representation

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Efficient Reinforcement Learning in Continuous State and Action Spaces with Dyna and Policy Approximation.

Dyna-H: a heuristic planning reinforcement learning algorithm applied to role-playing-game strategy decision systems

Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

Model-free Adaptive Dynamic Programming for Optimal Control of Discrete-time Affine Nonlinear System

Improved Dyna-Q: A Reinforcement Learning Method Focused via Heuristic Graph for AGV Path Planning in Dynamic Environments

Goal-Conditioned Hierarchical Reinforcement Learning with High-Level Model Approximation.

Efficient approximate dynamic programming based on design and analysis of computer experiments for infinite-horizon optimization

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

Dyna-style Model-based reinforcement learning with Model-Free Policy Optimization

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

A Hybrid PAC Reinforcement Learning Algorithm

Trajectory Sampling Value Iteration: Improved Dyna Search for MDPs

Multiple Suboptimal Policies Integrated Reinforcement Learning Algorithm for Path Planning

Hybrid Heuristic Online Planning for POMDPs

An immediate-return reinforcement learning for the atypical Markov decision processes

A Deep Reinforcement Learning Approach to Efficient Distributed Optimization

An Improved Dyna-Q Algorithm Inspired by the Forward Prediction Mechanism in the Rat Brain for Mobile Robot Path Planning

Umbilical cord serum activin A levels are increased in pre-eclampsia with impaired blood flow in the uteroplacental and fetal circulation.