Abstract:Model-Based Reinforcement Learning (MBRL) has been gradually applied in the field of Robot Learning due to its excellent sample efficiency and asymptotic performance. However, for high-dimensional learning tasks in complex scenes, the exploration and stable training capabilities of the robot still need enhancement. In light of policy planning and policy optimization, we propose a bidirectional model-based policy optimization algorithm based on adaptive gaussian noise and improved confidence weights (BMPO-NW). The algorithm parameterizes bidirectional policy networks into noise networks by adding different adaptive Gaussian noises to the connection weights and biases. This can improve the randomness of policy search and induce efficient exploration for the robot. Simultaneously, the confidence weight of improved activation function is introduced into the Q-function update formula of SAC, which can reduce the error propagation problem of target Q-network, and enhance the robot’s training stability. Finally, we implement the improved algorithm based on the framework of bidirectional model-based policy optimization algorithm (BMPO) to ensure asymptotic performance and sample efficiency. Experimental results in MuJoCo benchmark environments demonstrate that the learning speed of BMPO-NW is about 20% higher than baseline methods, the average reward is about 15% higher than other MBRL methods, and 50%-70% higher than MFRL methods, while the training process is more stable. Ablation experiments and different variant design experiments further verify the feasibility and robustness. The research results provide theoretical support for the conclusion of this paper and hold significant practical value for MBRL to help the robot realize applications in complex scenarios.

Model-Based Reinforcement Learning via Meta-Policy Optimization

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Bidirectional Model-based Policy Optimization

Model-Based Robot Learning Control with Uncertainty Directed Exploration

Adaptation Augmented Model-based Policy Optimization.

Model-based Policy Optimization with Unsupervised Model Adaptation

Dyna-style Model-based reinforcement learning with Model-Free Policy Optimization

Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Search

Model-Based Reinforcement Learning via Proximal Policy Optimization

Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Double Meta-Learning for Data Efficient Policy Optimization in Non-Stationary Environments

Model-Assisted Reinforcement Learning with Adaptive Ensemble Value Expansion

Model-based Policy Optimization using Symbolic World Model

Model-Based Off-Policy Deep Reinforcement Learning with Model-Embedding

Performance-Weighed Policy Sampling for Meta-Reinforcement Learning

Model-Based Offline Weighted Policy Optimization (Student Abstract)

Model-free Policy Learning with Reward Gradients

Model Gradient: Unified Model and Policy Learning in Model-Based Reinforcement Learning

Bidirectional Model-Based Policy Optimization Based on Adaptive Gaussian Noise and Improved Confidence Weights.

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Model-Ensemble Trust-Region Policy Optimization