Abstract:A central task in control theory, artificial intelligence, and formal methods is to synthesize reward-maximizing strategies for agents that operate in partially unknown environments. In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents' actions are known in terms of successor states but not the stochastics involved. In this paper, we devise a strategy synthesis algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model. To compete with limited sampling access in reinforcement learning, we incorporate two novel concepts into our algorithm, focusing on rapid and successful learning rather than on stochastic guarantees and optimality: lower confidence bound exploration reinforces variants of already learned practical strategies and action scoping reduces the learning action space to promising actions. We illustrate benefits of our algorithms by means of a prototypical implementation applied on examples from the AI and formal methods communities.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: in a partially unknown environment, how to quickly synthesize a near - optimal reward - maximizing policy through limited sampling access. Specifically, the paper focuses on how an agent can use reinforcement learning (RL) algorithms to determine a policy with high expected cumulative rewards in an environment modeled as a gray - box Markov decision process (MDP), when only the state - transition results are known but not their probabilities. ### Main Problems and Solutions 1. **Environment Model and Challenges** - The environment is modeled as a gray - box MDP, where the agent knows the impact of actions (i.e., successor states), but not the probabilities involved. - Traditional RL methods focus on the convergence and stochastic guarantees of globally optimal policies, resulting in slow learning and being infeasible for small sample sizes. 2. **Proposed Solutions** - Two new concepts are introduced to cope with rapid learning under limited sampling access: - **Lower Confidence Bound (LCB) Exploration**: Reinforce the practical policies already learned and reduce the exploration of uncertain paths. - **Action Scoping**: Narrow the learning action space to promising actions, thereby speeding up the policy synthesis. 3. **Algorithm Design** - An RL algorithm based on interval - based MDP (IMDP) as an internal model is proposed. - By iteratively updating the intervals in the IMDP and combining LCB and action scoping, the algorithm can quickly find a "good" policy under limited sampling. ### Formula Representation - **Definition of IMDP** \[ U=(S, A, \imath, G, R, \hat{T}) \] where \( S \) is the set of states, \( A \) is the set of actions, \( \imath \) is the initial state, \( G\subseteq S \) is the set of target states, \( R: S\rightarrow\mathbb{R} \) is the reward function, and \( \hat{T}: S\times A\rightarrow\text{Intv}(S) \) is the interval - transition function. - **Bellman Equation of the Value Function** \[ V_U(s)=\min_{M\in[U]} V_M(s),\quad V^U(s)=\max_{M\in[U]} V_M(s) \] where \( V_M(s) \) represents the value function of MDP \( M \). - **Quality Function** \[ Q(s, a)=R(s)+\sum_{s'\in\text{Post}(s, a)} V(s')\cdot T(s, a, s') \] ### Experimental Verification The paper studied the effects of UCB and LCB sampling methods and action scoping through multiple experiments. The results show that in the case of limited sampling, LCB and action scoping can synthesize near - optimal policies more quickly, especially performing well in tasks such as multi - armed bandits and RaceTrack. ### Summary By introducing LCB and action scoping, this paper solves the problem of quickly synthesizing near - optimal policies under limited sampling access, especially applicable to gray - box MDPs in partially unknown environments. These methods not only improve learning efficiency but also show better performance in practical applications.

Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

Strategy Synthesis in POMDPs via Game-Based Abstractions

Strategy synthesis for partially-known switched stochastic systems

Sampling-based Reactive Synthesis for Nondeterministic Hybrid Systems

Safety-Constrained Reinforcement Learning for MDPs

A learning-based synthesis approach of reward asynchronous probabilistic games against the linear temporal logic winning condition

Learning Optimal Strategies for Temporal Tasks in Stochastic Games

Efficient Strategy Synthesis for Switched Stochastic Systems with Distributional Uncertainty

Estimation and Control Using Sampling-Based Bayesian Reinforcement Learning

Search and Explore: Symbiotic Policy Synthesis in POMDPs

Counterexample-Guided Strategy Improvement for POMDPs Using Recurrent Neural Networks

On the markovian randomized strategy of controller for markov decision processes

Permissive Controller Synthesis for Probabilistic Systems

Data-Driven Strategy Synthesis for Stochastic Systems with Unknown Nonlinear Disturbances

Synthesis from LTL with Reward Optimization in Sampled Oblivious Environments

Model-Free $μ$ Synthesis via Adversarial Reinforcement Learning

Synthesis for multi-objective stochastic games: an application to autonomous urban driving

Model-Free Reinforcement Learning for Stochastic Games with Linear Temporal Logic Objectives

SOS: Safe, Optimal and Small Strategies for Hybrid Markov Decision Processes

Unpredictable Planning Under Partial Observability

Peer Review #3 of "A Learning-Based Synthesis Approach of Reward Asynchronous Probabilistic Games Against the Linear Temporal Logic Winning Condition (V0.1)"