Abstract:Model-Based Reinforcement Learning (MBRL) has been gradually applied in the field of Robot Learning due to its excellent sample efficiency and asymptotic performance. However, for high-dimensional learning tasks in complex scenes, the exploration and stable training capabilities of the robot still need enhancement. In light of policy planning and policy optimization, we propose a bidirectional model-based policy optimization algorithm based on adaptive gaussian noise and improved confidence weights (BMPO-NW). The algorithm parameterizes bidirectional policy networks into noise networks by adding different adaptive Gaussian noises to the connection weights and biases. This can improve the randomness of policy search and induce efficient exploration for the robot. Simultaneously, the confidence weight of improved activation function is introduced into the Q-function update formula of SAC, which can reduce the error propagation problem of target Q-network, and enhance the robot’s training stability. Finally, we implement the improved algorithm based on the framework of bidirectional model-based policy optimization algorithm (BMPO) to ensure asymptotic performance and sample efficiency. Experimental results in MuJoCo benchmark environments demonstrate that the learning speed of BMPO-NW is about 20% higher than baseline methods, the average reward is about 15% higher than other MBRL methods, and 50%-70% higher than MFRL methods, while the training process is more stable. Ablation experiments and different variant design experiments further verify the feasibility and robustness. The research results provide theoretical support for the conclusion of this paper and hold significant practical value for MBRL to help the robot realize applications in complex scenarios.

Simulation Optimization of Actions of Robot Based on POMDP Model

Model-Based Robot Learning Control with Uncertainty Directed Exploration

Safe Sim-to-Real Robot Exploration with Constrained Bayesian Optimization

Policy Optimization with Model-based Explorations

Model Predictive Optimization for Imitation Learning from Demonstrations.

Simulation of Robotic Arm Grasping Control Based on Proximal Policy Optimization Algorithm

Proximal policy optimization via enhanced exploration efficiency

Policy Graph Pruning And Optimization In Monte Carlo Value Iteration For Continuous-State Pomdps

Finding Optimal Memoryless Policies of POMDPs under the Expected Average Reward Criterion

Joint action loss for proximal policy optimization

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Proximal Policy Optimization with Future Rewards

Demonstration-Based Proximal Policy Optimization with Action Guidance

Observation-Based Optimization for POMDPs with Continuous State, Observation, and Action Spaces.

Bidirectional Model-Based Policy Optimization Based on Adaptive Gaussian Noise and Improved Confidence Weights.

Model-Based Reinforcement Learning via Proximal Policy Optimization

A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

Reward-Adaptive Reinforcement Learning: Dynamic Policy Gradient Optimization for Bipedal Locomotion

Hybrid and dynamic policy gradient optimization for bipedal robot locomotion

Simulation-Aided Policy Tuning for Black-Box Robot Learning