Abstract:Value function approximation, such as Q-learning, is widely used in the discrete control rather than the continuous one because the optimal action in the discrete setting is more easily selected. Optimizing the action is a non-convex optimization problem with respect to the complex value function. Some notable studies simplify the non-convex optimization problem by assuming the value function as quadratic in the actions or by discretizing the action space. However, the performance of the output policy will decline if these studies’ premises do not hold. In order to address the problem, we propose a framework that combines swarm intelligence algorithms with value-based Reinforcement Learning, where the swarm intelligence algorithms are employed to search for the optimal action with respect to the state and the value function. To ensure the correctness of this framework, we conditionally claim the convergence rate of swarm intelligence algorithms with high probability. We then implement it by searching the batch optimal actions to various states on the GPU platform for the batch training. Furthermore, we employ the population-based atomic actions for the compatibility with the existing related work about solving discrete control problems. Four classical control models and four robot simulation environments are utilized in the comparisons. According to empirical results, our framework outputs a policy comparable with that of the policy-based algorithms by 10% timesteps in the continuous control. Note to Practitioners—This paper is motivated by the exploration-exploitation dilemma of Reinforcement Learning to solve continuous control tasks. To balance the exploration and exploitation, the stochastic exploration and the prioritized exploration are roughly two feasible ways, where the prioritized one is a better choice due to the higher data efficiency than the stochastic one, e.g. $varepsilon$ -greedy. Normally, the prioritized exploration works well in the value-based Reinforcement Learning algorithms rather than the policy-based ones; meanwhile, the policy-based algorithms are more suitable to continuous control tasks than the value-based ones. To tackle this conflict, we especially design a particle swarm optimization to maximize the Q-value of action in Q-learning. Our design can be hybridized by various swarm intelligence and value-based Reinforcement Learning algorithms. Also, it can be embedded in most intelligent control systems easily. The aim of this study is to solve the continuous control tasks by value-based algorithms as the first step of applying the prioritized exploration. The simulative results verify the effectiveness and efficiency of our design.

Implicit Posterior Sampling Reinforcement Learning for Continuous Control

Model-Based Robot Learning Control with Uncertainty Directed Exploration

Estimation and Control Using Sampling-Based Bayesian Reinforcement Learning

Continuous Control With Swarm Intelligence Based Value Function Approximation

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation

Implicit Posteriori Parameter Distribution Optimization in Reinforcement Learning.

Model-Assisted Reinforcement Learning with Adaptive Ensemble Value Expansion

Manifold Regularization Based Approximate Value Iteration For Learning Control

Posterior Sampling for Deep Reinforcement Learning

Inverse Policy Evaluation for Value-based Sequential Decision-making

A Priori Estimates for Deep Residual Network in Continuous-time Reinforcement Learning

Model predictive control-based value estimation for efficient reinforcement learning

Prior-dependent analysis of posterior sampling reinforcement learning with function approximation

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

PAC-Bayesian Randomized Value Function with Informative Prior

A Neural Network Approach for Stochastic Optimal Control

Enhanced Probabilistic Inference Algorithm Using Probabilistic Neural Networks For Learning Control

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Value-Distributional Model-Based Reinforcement Learning

LiFE:Deep Exploration Via Linear-Feature Bonus in Continuous Control

Implicitly Regularized RL with Implicit Q-Values