Abstract:We consider a robust reinforcement learning problem, where a learning agent learns from a simulated training environment. To account for the model mis-specification between this training environment and the real environment due to lack of data, we adopt a formulation of Bayesian risk MDP (BRMDP) with infinite horizon, which uses Bayesian posterior to estimate the transition model and impose a risk functional to account for the model uncertainty. Observations from the real environment that is out of the agent's control arrive periodically and are utilized by the agent to update the Bayesian posterior to reduce model uncertainty. We theoretically demonstrate that BRMDP balances the trade-off between robustness and conservativeness, and we further develop a multi-stage Bayesian risk-averse Q-learning algorithm to solve BRMDP with streaming observations from real environment. The proposed algorithm learns a risk-averse yet optimal policy that depends on the availability of real-world observations. We provide a theoretical guarantee of strong convergence for the proposed algorithm.

What problem does this paper attempt to address?

The paper attempts to address the issue in reinforcement learning where the performance of the learned policy degrades in real-world applications due to model mismatch between the training environment and the real environment. Specifically, the paper focuses on how to learn an optimal policy that is both robust and not overly conservative in the presence of model uncertainty by employing a Bayesian Risk Markov Decision Process (BRMDP). To solve this problem, the paper proposes a multi-stage Bayesian risk-averse Q-learning algorithm that can continuously update the Bayesian posterior distribution using real-time observational data from the real environment, thereby reducing model uncertainty and ultimately learning a risk-averse optimal policy that depends on actual observational data. The main contributions of the paper include: 1. **Theoretical Contribution**: Proposes an infinite-horizon Bayesian Risk Markov Decision Process (BRMDP) and proves that BRMDP can achieve a balance between robustness and conservatism. 2. **Algorithmic Contribution**: Develops a multi-stage Bayesian risk-averse Q-learning algorithm that can handle streaming observational data from the real environment, gradually reducing model uncertainty. 3. **Convergence Guarantee**: Provides strong theoretical convergence guarantees for the proposed algorithm, ensuring that the learned Q-function can asymptotically approach the optimal Q-function. 4. **Experimental Validation**: Validates the effectiveness of the proposed method through numerical experiments, demonstrating that the proposed method can outperform existing distributionally robust Q-learning methods, especially in scenarios with high model uncertainty. In summary, the paper aims to improve the robustness and adaptability of reinforcement learning in the face of model uncertainty by combining Bayesian methods and risk-averse strategies.

Bayesian Risk-Averse Q-Learning with Streaming Observations

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Bayesian Stochastic Gradient Descent for Stochastic Optimization with Streaming Input Data

Model and Reinforcement Learning for Markov Games with Risk Preferences

Risk-Averse Bayes-Adaptive Reinforcement Learning

Context-Aware Safe Reinforcement Learning for Non-Stationary Environments

Adaptive Deep Reinforcement Learning for Non-Stationary Environments

A new online learning algorithm for streaming data and decision support with a Bayesian approach

State-Wise Safe Reinforcement Learning With Pixel Observations

Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures

Improving Robustness via Risk Averse Distributional Reinforcement Learning

Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning

A Bayesian Approach to Robust Inverse Reinforcement Learning

Approximating Pareto Frontier Through Bayesian-optimization-directed Robust Multi-objective Reinforcement Learning

Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis

Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty

Distributional Method for Risk Averse Reinforcement Learning

Risk-Averse Control of Markov Systems with Value Function Learning