Abstract:A reinforcement learning agent tries to maximize its cumulative payoff by interacting in an unknown environment. It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards. Yet, in sensitive domains, collecting more data with exploration is not always possible, but it is important to find a policy with a certain performance guaranty. In this paper, we present a brief survey of methods available in the literature for balancing exploration-exploitation trade off and computing robust solutions from fixed samples in reinforcement learning.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is **how to balance the exploration - exploitation trade - off in reinforcement learning and compute robust solutions from a limited number of samples**. Specifically, the paper focuses on: 1. **Balance between exploration and exploitation**: In an unknown environment, an agent needs to explore to discover potentially optimal actions, and at the same time, it also needs to utilize known optimal actions to obtain the maximum cumulative reward. However, in sensitive fields (such as medicine, finance, etc.), due to limitations in data collection, more exploration cannot be carried out at will. Therefore, it is crucial to find a method that can ensure performance with limited data. 2. **Computing robust solutions from a fixed sample**: In the Batch Reinforcement Learning (Batch RL) setting, an agent can only learn based on a historical interaction data set and cannot collect more data at will. In this case, a policy with certain performance guarantees still needs to be obtained. For example, ensure that the performance of the new policy is not lower than that of the currently deployed policy, thereby reducing risks in practical applications. To address these issues, the paper reviews several existing methods, including but not limited to: - **Optimism in the Face of Uncertainty (OFU)**: Encourages the agent to try actions with high uncertainty but great potential. - **Posterior Sampling for Reinforcement Learning (PSRL)**: Selects the optimal policy by sampling possible Markov Decision Process (MDP) models from the posterior distribution. - **Robustness optimization**: Considers maximizing the expected return in the worst - case scenario, constructs a Robust Markov Decision Process (RMDP), and solves the corresponding robust Bellman equation: \[ V(s)=\max_{a\in A}\left(\hat{r}(s, a)+\min_{p\in P(s, a)}\left(\sum_{s'\in S}p(s')V(s')\right)\right) \] These methods aim to ensure that the agent can make more robust and reliable decisions when facing uncertainty and limited data.

A Short Survey on Probabilistic Reinforcement Learning

A Survey of Exploration Methods in Reinforcement Learning

Bayesian Reinforcement Learning: A Survey

Reinforcement Learning with Probabilistically Complete Exploration

Towards Uncertainty in Decision: A Survey on Recent Advances and Challenges in Bayesian Reinforcement Learning

A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, Challenges

A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Beyond Optimism: Exploration With Partially Observable Rewards

Reinforcement Learning Algorithms: A brief survey

Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Dealing with uncertainty: Balancing exploration and exploitation in deep recurrent reinforcement learning

Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain

State-wise Safe Reinforcement Learning: A Survey

Hierarchical Reinforcement Learning: A Survey and Open Research Challenges

A Survey On Enhancing Reinforcement Learning in Complex Environments: Insights from Human and LLM Feedback

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

A survey of benchmarking frameworks for reinforcement learning

Survey of Self-Play in Reinforcement Learning

Evolutionary Reinforcement Learning: A Survey