Abstract:Classic reinforcement learning algorithms mainly aim at the discrete state and action spaces.For the complex environment or the more applicable continuous spaces,the methods for the discrete spaces cannot satisfy the requirement.One feasible method is to discretize the state and action spaces,then the methods applied in discrete spaces can solve these problems with continuous state and action spaces.However,the reasonable discretization for the state and action spaces is not an easy problem.The methods applicable in continuous spaces do not have to discretize the state or action spaces,but most of them did not consider the constraint of the action range,additionally,the fluctuations of the optimal action were heavily.To be more applicable in continuous action spaces,we propose an actor-critic algorithm for continuous action space based on weighting of the actions by considering the constraint of the action range and decreasing the fluctuation,called AW-PS-AC.AW-PS-AC is designed in the framework of the actor-critic which is a classic method for the continuous space.The action exploration policy takes the Gaussian distribute by using the optimal action as the mean value,so that the selective action is the action with a small exploration factor.The optimal state value function and the optimal policy are approximated by linear function approximation,where the gradient descent method is utilized to update one set of the value function parameter and two sets of the policy parameters.The two sets of the policy parameters are weighted to obtain the optimal policy to constraint the optimal action,so that the optimal action will not surpass the action range and the optimal policy will not fluctuate significantly.The weighting for the actions can satisfy the constraint of the action range.Moreover,the samples can be utilized more comprehensively,resulting in a better performance under only a small amount of the data.To speed the convergence rate,an improved temporal difference algorithm is designed,where the temporal difference error (TD-error) of the value function are employed to update the optimal policy and the policy eligibility trace is introduced to improve the convergence rate for the algorithm.To prove the convergence of this proposed method,under the three given assumptions,AW-PS-AC is analyzed theoretically and its convergence is proved.On two classic benchmarks of the classic reinforcement learning benchmarks which have the nonlinear system dynamics,pole-balancing problem and puddle world problem,AW-PS-AC is compared with the representative methods which are representative in continuous spaces,namely,continuous actor-critic learning automaton (CALAC),continuous-action on Q-learning (CAQ) and incremental natural actor-critic with scaling gradient (INAC-S),and they are implemented on them.The results show that the AW-PS-AC algorithm performs well in the two experiments.The good performances in the two experiments demonstrate that the AW-PS-AC algorithm can solve the approximated-optimal problems effectively in continuous space.Compared with the state-of-the-art algorithms,AW-PS-AC outperforms them not only in convergence but also in stability.From the experiments,it is clearly that AW-PS-AC algorithm can converge only after only a few episodes,moreover,it can be stable all the time after it is converged.

An efficient reinforcement learning algorithm for continuous actions

Efficient Reinforcement Learning in Continuous State and Action Spaces with Dyna and Policy Approximation.

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

A CNN-based Policy for Optimizing Continuous Action Control by Learning State Sequences

An Optimized Dyna Architecture Algorithm with Prioritized Sweeping

Kernel-Based Continuous-Action Actor-Critic Learning

An Improved Actor-Critic Algorithm in Continuous Spaces with Action Weighting

An efficient reinforcement learning algorithm for learning deterministic policies in continuous domains

Continuous control with deep reinforcement learning

Bayesian Q learning method with Dyna architecture and prioritized sweeping

Active Exploration Deep Reinforcement Learning for Continuous Action Space with Forward Prediction

An Efficient Learning Automaton Scheme for Massive-Action Environments

A Novel Q-Learning Approach with Continuous States and Actions

Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces

Reinforcement Learning Method For Continuous State Space Based On Dynamic Neural Network

Achieving Multiagent Coordination Through Cala-Rfmq Learning In Continuous Action Space

A Set of Novel Continuous Action-Set Reinforcement Learning Automata Models to Optimize Continuous Functions

Dyna-Validator: A Model-based Reinforcement Learning Method with Validated Simulated Experiences.

Continuous deep q-learning with model-based acceleration

Asynchronous reinforcement learning algorithms for solving discrete space path planning problems

A Phased Dyna Reinforcement Learning Algorithm