Abstract:We present a data-driven approach for producing policies that are provably robust across unknown stochastic environments. Existing approaches can learn models of a single environment as an interval Markov decision processes (IMDP) and produce a robust policy with a probably approximately correct (PAC) guarantee on its performance. However these are unable to reason about the impact of environmental parameters underlying the uncertainty. We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters. We learn and analyse IMDPs for a set of unknown sample environments induced by parameters. The key challenge is then to produce meaningful performance guarantees that combine the two layers of uncertainty: (1) multiple environments induced by parameters with an unknown distribution; (2) unknown induced environments which are approximated by IMDPs. We present a novel approach based on scenario optimisation that yields a single PAC guarantee quantifying the risk level for which a specified performance level can be assured in unseen environments, plus a means to trade-off risk and performance. We implement and evaluate our framework using multiple robust policy generation methods on a range of benchmarks. We show that our approach produces tight bounds on a policy's performance with high confidence.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate provably robust policies in autonomous systems in an environment with uncertain parameters. Specifically, the authors propose a framework based on the parameterized Markov decision process (pMDP), aiming to handle the uncertainty caused by the unknown parameter distribution. Existing methods can learn the model of a single environment and generate robust policies with probably approximately correct (PAC) performance guarantees, but these methods cannot consider the impact of environmental parameters on uncertainty. The method proposed in this paper can not only handle multiple unknown environments induced by parameters, but also provide meaningful performance guarantees in these environments, while quantifying the trade - off between the risk level and performance.
### Main contributions of the paper
1. **Novel framework and technology**: Propose a new framework for generating provably robust policies in an uncertain parameterized Markov decision process, where both the parameters and the transition probability function are unknown.
2. **New theoretical results**: Provide PAC guarantees on the robust performance of policies in unseen environments, which are sampled from an unknown distribution and can only be estimated from trajectory data.
3. **Implementation and evaluation**: Implement this framework in multiple benchmarks and show that it can closely quantify the performance of the learned policy and its associated risks.
### Key technologies
- **Parameterized Markov decision process (pMDP)**: Define a Markov decision process that includes a parameter space, and each parameter value can induce a standard MDP.
- **Interval Markov decision process (IMDP)**: Learn the IMDP approximation of an unknown MDP through sample trajectories, thereby obtaining a lower bound on performance.
- **Scenario optimization**: Combine two - layer uncertainties (the unknown distribution of parameter values and the unknown environment), providing a single PAC guarantee that quantifies the performance risk of the policy in unseen environments.
- **Risk tuning**: Allow users to adjust the trade - off between performance guarantees and risks by excluding the worst - case sample environments.
### Application example
Taking the UAV path planning as an example, the paper illustrates how to generate robust policies that can safely complete tasks under various conditions when environmental parameters such as wind speed and direction change. Through this method, the UAV can choose a shorter path under low - interference conditions and a safer path under high - interference conditions, ensuring a high probability of task completion.
### Theoretical basis
- **Performance evaluation function** \(J(\pi, \theta)\): Used to evaluate the performance of policy \(\pi\) in the environment induced by parameter value \(\theta\).
- **Violation risk** \(r(\pi, \tilde{J})\): The probability that the performance of policy \(\pi\) is lower than the specified threshold \(\tilde{J}\).
- **PAC guarantee**: Through the scenario optimization method, provide a single PAC guarantee to ensure that the performance risk of the policy in unseen environments is within the user - specified confidence level.
### Mathematical formulas
- **Performance evaluation function**:
\[
J(\pi, \theta)=\text{evaluate the performance of policy }\pi\text{ in the environment induced by parameter value }\theta
\]
- **Violation risk**:
\[
r(\pi, \tilde{J}) = P\{\theta\in\Theta:J(\pi, \theta)<\tilde{J}\}
\]
- **PAC guarantee**:
\[
P\{r(\pi, \tilde{J}_\gamma)\leq\epsilon(N, \gamma, \eta)\}\geq1 - \eta
\]
where \(\tilde{J}_\gamma=\min_iJ(\pi, M_\gamma[\theta_i])\), and \(\epsilon(N, \gamma, \eta)\) is the solution of the following equation: