Abstract:Model-Based Reinforcement Learning (MBRL) has been gradually applied in the field of Robot Learning due to its excellent sample efficiency and asymptotic performance. However, for high-dimensional learning tasks in complex scenes, the exploration and stable training capabilities of the robot still need enhancement. In light of policy planning and policy optimization, we propose a bidirectional model-based policy optimization algorithm based on adaptive gaussian noise and improved confidence weights (BMPO-NW). The algorithm parameterizes bidirectional policy networks into noise networks by adding different adaptive Gaussian noises to the connection weights and biases. This can improve the randomness of policy search and induce efficient exploration for the robot. Simultaneously, the confidence weight of improved activation function is introduced into the Q-function update formula of SAC, which can reduce the error propagation problem of target Q-network, and enhance the robot’s training stability. Finally, we implement the improved algorithm based on the framework of bidirectional model-based policy optimization algorithm (BMPO) to ensure asymptotic performance and sample efficiency. Experimental results in MuJoCo benchmark environments demonstrate that the learning speed of BMPO-NW is about 20% higher than baseline methods, the average reward is about 15% higher than other MBRL methods, and 50%-70% higher than MFRL methods, while the training process is more stable. Ablation experiments and different variant design experiments further verify the feasibility and robustness. The research results provide theoretical support for the conclusion of this paper and hold significant practical value for MBRL to help the robot realize applications in complex scenarios.

Training Reinforcement Neurocontrollers Using the Polytope Algorithm

PolyNet: Learning Diverse Solution Strategies for Neural Combinatorial Optimization

Training Efficient Controllers via Analytic Policy Gradient

Neural Combinatorial Optimization: a New Player in the Field

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Principled Deep Neural Network Training through Linear Programming

PLATO: Policy Learning using Adaptive Trajectory Optimization

Conformal Symplectic Optimization for Stable Reinforcement Learning

Proximal Policy Optimization Algorithms

Robust Optimization through Neuroevolution

Optimal Learning Output Tracking Control: A Model-Free Policy Optimization Method With Convergence Analysis

A New Optimization Model for MLP Hyperparameter Tuning: Modeling and Resolution by Real-Coded Genetic Algorithm

Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

Self-Improvement for Neural Combinatorial Optimization: Sample without Replacement, but Improvement

Clipped-Objective Policy Gradients for Pessimistic Policy Optimization

Improving Policy Optimization via $\varepsilon$-Retrain

Bidirectional Model-Based Policy Optimization Based on Adaptive Gaussian Noise and Improved Confidence Weights.

Beyond the Boundaries of Proximal Policy Optimization

Evolving Genes to Balance a Pole

POLAR: Preference Optimization and Learning Algorithms for Robotics

Discovered Policy Optimisation