A Short Survey on Probabilistic Reinforcement Learning

Reazul Hasan Russel
DOI: https://doi.org/10.48550/arXiv.1901.07010
2019-01-22
Abstract:A reinforcement learning agent tries to maximize its cumulative payoff by interacting in an unknown environment. It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards. Yet, in sensitive domains, collecting more data with exploration is not always possible, but it is important to find a policy with a certain performance guaranty. In this paper, we present a brief survey of methods available in the literature for balancing exploration-exploitation trade off and computing robust solutions from fixed samples in reinforcement learning.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is **how to balance the exploration - exploitation trade - off in reinforcement learning and compute robust solutions from a limited number of samples**. Specifically, the paper focuses on: 1. **Balance between exploration and exploitation**: In an unknown environment, an agent needs to explore to discover potentially optimal actions, and at the same time, it also needs to utilize known optimal actions to obtain the maximum cumulative reward. However, in sensitive fields (such as medicine, finance, etc.), due to limitations in data collection, more exploration cannot be carried out at will. Therefore, it is crucial to find a method that can ensure performance with limited data. 2. **Computing robust solutions from a fixed sample**: In the Batch Reinforcement Learning (Batch RL) setting, an agent can only learn based on a historical interaction data set and cannot collect more data at will. In this case, a policy with certain performance guarantees still needs to be obtained. For example, ensure that the performance of the new policy is not lower than that of the currently deployed policy, thereby reducing risks in practical applications. To address these issues, the paper reviews several existing methods, including but not limited to: - **Optimism in the Face of Uncertainty (OFU)**: Encourages the agent to try actions with high uncertainty but great potential. - **Posterior Sampling for Reinforcement Learning (PSRL)**: Selects the optimal policy by sampling possible Markov Decision Process (MDP) models from the posterior distribution. - **Robustness optimization**: Considers maximizing the expected return in the worst - case scenario, constructs a Robust Markov Decision Process (RMDP), and solves the corresponding robust Bellman equation: \[ V(s)=\max_{a\in A}\left(\hat{r}(s, a)+\min_{p\in P(s, a)}\left(\sum_{s'\in S}p(s')V(s')\right)\right) \] These methods aim to ensure that the agent can make more robust and reliable decisions when facing uncertainty and limited data.