Bayesian Risk-Averse Q-Learning with Streaming Observations

Yuhao Wang,Enlu Zhou
2023-05-19
Abstract:We consider a robust reinforcement learning problem, where a learning agent learns from a simulated training environment. To account for the model mis-specification between this training environment and the real environment due to lack of data, we adopt a formulation of Bayesian risk MDP (BRMDP) with infinite horizon, which uses Bayesian posterior to estimate the transition model and impose a risk functional to account for the model uncertainty. Observations from the real environment that is out of the agent's control arrive periodically and are utilized by the agent to update the Bayesian posterior to reduce model uncertainty. We theoretically demonstrate that BRMDP balances the trade-off between robustness and conservativeness, and we further develop a multi-stage Bayesian risk-averse Q-learning algorithm to solve BRMDP with streaming observations from real environment. The proposed algorithm learns a risk-averse yet optimal policy that depends on the availability of real-world observations. We provide a theoretical guarantee of strong convergence for the proposed algorithm.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue in reinforcement learning where the performance of the learned policy degrades in real-world applications due to model mismatch between the training environment and the real environment. Specifically, the paper focuses on how to learn an optimal policy that is both robust and not overly conservative in the presence of model uncertainty by employing a Bayesian Risk Markov Decision Process (BRMDP). To solve this problem, the paper proposes a multi-stage Bayesian risk-averse Q-learning algorithm that can continuously update the Bayesian posterior distribution using real-time observational data from the real environment, thereby reducing model uncertainty and ultimately learning a risk-averse optimal policy that depends on actual observational data. The main contributions of the paper include: 1. **Theoretical Contribution**: Proposes an infinite-horizon Bayesian Risk Markov Decision Process (BRMDP) and proves that BRMDP can achieve a balance between robustness and conservatism. 2. **Algorithmic Contribution**: Develops a multi-stage Bayesian risk-averse Q-learning algorithm that can handle streaming observational data from the real environment, gradually reducing model uncertainty. 3. **Convergence Guarantee**: Provides strong theoretical convergence guarantees for the proposed algorithm, ensuring that the learned Q-function can asymptotically approach the optimal Q-function. 4. **Experimental Validation**: Validates the effectiveness of the proposed method through numerical experiments, demonstrating that the proposed method can outperform existing distributionally robust Q-learning methods, especially in scenarios with high model uncertainty. In summary, the paper aims to improve the robustness and adaptability of reinforcement learning in the face of model uncertainty by combining Bayesian methods and risk-averse strategies.