Tomáš Brázdil,Krishnendu Chatterjee,Martin Chmelik,Vojtěch Forejt,Jan Křetínský,Marta Kwiatkowska,Tobias Meggendorfer,David Parker,Mateusz Ujma
Abstract:We present a general framework for applying learning algorithms and heuristical guidance to the verification of Markov decision processes (MDPs). The primary goal of our techniques is to improve performance by avoiding an exhaustive exploration of the state space, instead focussing on particularly relevant areas of the system, guided by heuristics. Our work builds on the previous results of Br{á}zdil et al., significantly extending it as well as refining several details and fixing errors.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address the verification problem of Markov Decision Processes (MDP), specifically how to improve performance by avoiding exhaustive exploration of the state space. Specifically, the paper proposes a general framework that leverages learning algorithms and heuristic guidance to achieve this goal.
**Main Contributions Include:**
1. **Probabilistic Reachability Problem**:
- The framework focuses on the probabilistic reachability problem, which is a core issue in verification. The framework is implemented in two different scenarios:
- The first scenario assumes complete knowledge of the MDP, including exact transition probabilities. This method performs heuristic-driven partial exploration to obtain precise upper and lower bounds of the required probability.
- The second scenario can only sample the MDP without knowing the exact transition dynamics. In this case, the method provides probabilistic guarantees, i.e., estimates of the upper and lower bounds, thus offering an effective stopping criterion for approximation.
2. **Algorithm Framework**:
- A scalable framework is proposed to efficiently solve the reachability problem on "full-information" MDPs and extend it to arbitrary MDPs.
- A model-free PAC learning algorithm suitable for "limited-information" MDPs is introduced and extended to arbitrary MDPs.
3. **Statistical Model Checking**:
- In the limited information setting, a PAC model-free algorithm based on Delayed Q-Learning is proposed, which can provide statistical upper and lower bounds on the maximum reachability.
4. **Impact on Related Work**:
- The work of this paper directly influences many subsequent studies, particularly in the application of BRTDP methods and their variants, which have been extended to areas such as long-term average rewards, continuous-time Markov chains, continuous-space MDPs, and stochastic games.
In summary, the main goal of this paper is to improve the efficiency of MDP verification through heuristic methods and learning algorithms, especially when dealing with large-scale systems, thus avoiding traditional exhaustive exploration methods.