Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

Christel Baier,Clemens Dubslaff,Patrick Wienhöft,Stefan J. Kiebel
2023-04-24
Abstract:A central task in control theory, artificial intelligence, and formal methods is to synthesize reward-maximizing strategies for agents that operate in partially unknown environments. In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents' actions are known in terms of successor states but not the stochastics involved. In this paper, we devise a strategy synthesis algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model. To compete with limited sampling access in reinforcement learning, we incorporate two novel concepts into our algorithm, focusing on rapid and successful learning rather than on stochastic guarantees and optimality: lower confidence bound exploration reinforces variants of already learned practical strategies and action scoping reduces the learning action space to promising actions. We illustrate benefits of our algorithms by means of a prototypical implementation applied on examples from the AI and formal methods communities.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: in a partially unknown environment, how to quickly synthesize a near - optimal reward - maximizing policy through limited sampling access. Specifically, the paper focuses on how an agent can use reinforcement learning (RL) algorithms to determine a policy with high expected cumulative rewards in an environment modeled as a gray - box Markov decision process (MDP), when only the state - transition results are known but not their probabilities. ### Main Problems and Solutions 1. **Environment Model and Challenges** - The environment is modeled as a gray - box MDP, where the agent knows the impact of actions (i.e., successor states), but not the probabilities involved. - Traditional RL methods focus on the convergence and stochastic guarantees of globally optimal policies, resulting in slow learning and being infeasible for small sample sizes. 2. **Proposed Solutions** - Two new concepts are introduced to cope with rapid learning under limited sampling access: - **Lower Confidence Bound (LCB) Exploration**: Reinforce the practical policies already learned and reduce the exploration of uncertain paths. - **Action Scoping**: Narrow the learning action space to promising actions, thereby speeding up the policy synthesis. 3. **Algorithm Design** - An RL algorithm based on interval - based MDP (IMDP) as an internal model is proposed. - By iteratively updating the intervals in the IMDP and combining LCB and action scoping, the algorithm can quickly find a "good" policy under limited sampling. ### Formula Representation - **Definition of IMDP** \[ U=(S, A, \imath, G, R, \hat{T}) \] where \( S \) is the set of states, \( A \) is the set of actions, \( \imath \) is the initial state, \( G\subseteq S \) is the set of target states, \( R: S\rightarrow\mathbb{R} \) is the reward function, and \( \hat{T}: S\times A\rightarrow\text{Intv}(S) \) is the interval - transition function. - **Bellman Equation of the Value Function** \[ V_U(s)=\min_{M\in[U]} V_M(s),\quad V^U(s)=\max_{M\in[U]} V_M(s) \] where \( V_M(s) \) represents the value function of MDP \( M \). - **Quality Function** \[ Q(s, a)=R(s)+\sum_{s'\in\text{Post}(s, a)} V(s')\cdot T(s, a, s') \] ### Experimental Verification The paper studied the effects of UCB and LCB sampling methods and action scoping through multiple experiments. The results show that in the case of limited sampling, LCB and action scoping can synthesize near - optimal policies more quickly, especially performing well in tasks such as multi - armed bandits and RaceTrack. ### Summary By introducing LCB and action scoping, this paper solves the problem of quickly synthesizing near - optimal policies under limited sampling access, especially applicable to gray - box MDPs in partially unknown environments. These methods not only improve learning efficiency but also show better performance in practical applications.