Operator Splitting for Convex Constrained Markov Decision Processes

Panagiotis D. Grontas,Anastasios Tsiamis,John Lygeros
2024-12-19
Abstract:We consider finite Markov decision processes (MDPs) with convex constraints and known dynamics. In principle, this problem is amenable to off-the-shelf convex optimization solvers, but typically this approach suffers from poor scalability. In this work, we develop a first-order algorithm, based on the Douglas-Rachford splitting, that allows us to decompose the dynamics and constraints. Thanks to this decoupling, we can incorporate a wide variety of convex constraints. Our scheme consists of simple and easy-to-implement updates that alternate between solving a regularized MDP and a projection. The inherent presence of regularized updates ensures last-iterate convergence, numerical stability, and, contrary to existing approaches, does not require us to regularize the problem explicitly. If the constraints are not attainable, we exploit salient properties of the Douglas-Rachord algorithm to detect infeasibility and compute a policy that minimally violates the constraints. We demonstrate the performance of our algorithm on two benchmark problems and show that it compares favorably to competing approaches.
Optimization and Control,Systems and Control
What problem does this paper attempt to address?
This paper attempts to solve the optimization problem in the finite Markov decision process (MDP) with convex constraints. Specifically, the paper focuses on how to efficiently solve the MDP with convex constraints under the condition of known dynamic characteristics. Although in principle such problems can be solved by off - the - shelf convex optimization solvers, these methods usually have the problem of poor scalability, especially when the state - action space is large. ### Main contributions of the paper 1. **Proposed a first - order algorithm based on Douglas - Rachford splitting**: - By introducing the decomposition of dynamics and constraints, it is possible to handle a wide variety of convex constraints. - The algorithm includes simple and easy - to - implement update steps, alternating between solving the regularized MDP and projection operations. - The inherent regularized update ensures the convergence and numerical stability of the final iteration, and there is no need to explicitly regularize the problem. 2. **Ability to handle infeasible problems**: - Utilize the characteristics of the Douglas - Rachford algorithm to detect infeasibility and calculate the policy with the minimum violated constraints. - When the constraints are unreachable, the algorithm can find the solution closest to the feasible region. 3. **Performance verification**: - Demonstrated the performance of the algorithm on two benchmark problems and proved that it is superior to existing methods. ### Technical details #### A. Decomposition of dynamic systems and constraints The paper designs an efficient modular algorithm by separating the dynamic part and the constraint part of the MDP using the Douglas - Rachford splitting algorithm. The key lies in how to effectively solve sub - problems (7a) and (7c), namely: - **Sub - problem (7a)**: Solved by quadratic regularized MDP. - **Sub - problem (7c)**: Projected onto the closed convex set C. #### B. Solving the regularized MDP For sub - problem (7a), the paper proposes an iterative scheme, called quadratic regularized policy iteration (QRPI), with the following specific steps: \[ V^{in}_{\ell + 1}\leftarrow((\gamma P-\Xi)^{\top}(\gamma P-\Xi))^{-1}\left((\gamma P-\Xi)^{\top}\left(\frac{1}{\sigma}w_{k}-c+\phi^{in}_{\ell}\right)+\frac{1}{\sigma}(1 - \gamma)\rho\right) \] \[ \phi^{in}_{\ell+1}\leftarrow\max\left(c+\gamma PV^{in}_{\ell+1}-\Xi V^{in}_{\ell+1}-\frac{1}{\sigma}w_{k},0\right) \] \[ d^{in}_{\ell+1}\leftarrow\sigma\max\left(-c-\gamma PV^{in}_{\ell+1}+\Xi V^{in}_{\ell+1}+\frac{1}{\sigma}w_{k},0\right) \] #### C. Constraint projection For sub - problem (7c), the paper expresses it in a more familiar projection form: \[ \text{prox}_{\sigma g}(2d_{k}-w_{k})=\arg\min_{d'\in C}\|d'-(2d_{k}-w_{k})\|^{2}=P_{C}(2d_{k}-w_{k}) \] #### D. Overall algorithm Combining the above update rules, the paper proposes the OS - CMDP algorithm, which contains two inner and outer loops: - The outer loop executes DRA. - The inner loop approximately solves the regularized MDP through QRPI. ### Convergence analysis The paper also analyzes the convergence of the algorithm in detail, including the convergence of QRPI and the asymptotic behavior of the overall algorithm in feasible and infeasible cases. ### Termination conditions To specify meaningful termination conditions, the paper derives the optimality conditions and suggests the following termination criteria: \[ \|d_{k}-