Abstract:We consider finite Markov decision processes (MDPs) with convex constraints and known dynamics. In principle, this problem is amenable to off-the-shelf convex optimization solvers, but typically this approach suffers from poor scalability. In this work, we develop a first-order algorithm, based on the Douglas-Rachford splitting, that allows us to decompose the dynamics and constraints. Thanks to this decoupling, we can incorporate a wide variety of convex constraints. Our scheme consists of simple and easy-to-implement updates that alternate between solving a regularized MDP and a projection. The inherent presence of regularized updates ensures last-iterate convergence, numerical stability, and, contrary to existing approaches, does not require us to regularize the problem explicitly. If the constraints are not attainable, we exploit salient properties of the Douglas-Rachord algorithm to detect infeasibility and compute a policy that minimally violates the constraints. We demonstrate the performance of our algorithm on two benchmark problems and show that it compares favorably to competing approaches.

What problem does this paper attempt to address?

This paper attempts to solve the optimization problem in the finite Markov decision process (MDP) with convex constraints. Specifically, the paper focuses on how to efficiently solve the MDP with convex constraints under the condition of known dynamic characteristics. Although in principle such problems can be solved by off - the - shelf convex optimization solvers, these methods usually have the problem of poor scalability, especially when the state - action space is large. ### Main contributions of the paper 1. **Proposed a first - order algorithm based on Douglas - Rachford splitting**: - By introducing the decomposition of dynamics and constraints, it is possible to handle a wide variety of convex constraints. - The algorithm includes simple and easy - to - implement update steps, alternating between solving the regularized MDP and projection operations. - The inherent regularized update ensures the convergence and numerical stability of the final iteration, and there is no need to explicitly regularize the problem. 2. **Ability to handle infeasible problems**: - Utilize the characteristics of the Douglas - Rachford algorithm to detect infeasibility and calculate the policy with the minimum violated constraints. - When the constraints are unreachable, the algorithm can find the solution closest to the feasible region. 3. **Performance verification**: - Demonstrated the performance of the algorithm on two benchmark problems and proved that it is superior to existing methods. ### Technical details #### A. Decomposition of dynamic systems and constraints The paper designs an efficient modular algorithm by separating the dynamic part and the constraint part of the MDP using the Douglas - Rachford splitting algorithm. The key lies in how to effectively solve sub - problems (7a) and (7c), namely: - **Sub - problem (7a)**: Solved by quadratic regularized MDP. - **Sub - problem (7c)**: Projected onto the closed convex set C. #### B. Solving the regularized MDP For sub - problem (7a), the paper proposes an iterative scheme, called quadratic regularized policy iteration (QRPI), with the following specific steps: \[ V^{in}_{\ell + 1}\leftarrow((\gamma P-\Xi)^{\top}(\gamma P-\Xi))^{-1}\left((\gamma P-\Xi)^{\top}\left(\frac{1}{\sigma}w_{k}-c+\phi^{in}_{\ell}\right)+\frac{1}{\sigma}(1 - \gamma)\rho\right) \] \[ \phi^{in}_{\ell+1}\leftarrow\max\left(c+\gamma PV^{in}_{\ell+1}-\Xi V^{in}_{\ell+1}-\frac{1}{\sigma}w_{k},0\right) \] \[ d^{in}_{\ell+1}\leftarrow\sigma\max\left(-c-\gamma PV^{in}_{\ell+1}+\Xi V^{in}_{\ell+1}+\frac{1}{\sigma}w_{k},0\right) \] #### C. Constraint projection For sub - problem (7c), the paper expresses it in a more familiar projection form: \[ \text{prox}_{\sigma g}(2d_{k}-w_{k})=\arg\min_{d'\in C}\|d'-(2d_{k}-w_{k})\|^{2}=P_{C}(2d_{k}-w_{k}) \] #### D. Overall algorithm Combining the above update rules, the paper proposes the OS - CMDP algorithm, which contains two inner and outer loops: - The outer loop executes DRA. - The inner loop approximately solves the regularized MDP through QRPI. ### Convergence analysis The paper also analyzes the convergence of the algorithm in detail, including the convergence of QRPI and the asymptotic behavior of the overall algorithm in feasible and infeasible cases. ### Termination conditions To specify meaningful termination conditions, the paper derives the optimality conditions and suggests the following termination criteria: \[ \|d_{k}-

Operator Splitting for Convex Constrained Markov Decision Processes

Accelerated forward-backward and Douglas-Rachford splitting dynamics

Constrained Risk-Averse Markov Decision Processes

A Unified Contraction Analysis of a Class of Distributed Algorithms for Composite Optimization

A Forward-Backward Bregman Splitting Scheme for Regularized Distributed Optimization Problems

A Partially Parallel Splitting Method for Multiple-Block Separable Convex Programming with Applications to Robust Pca

A four-operator splitting algorithm for nonconvex and nonsmooth optimization

A Modified Primal-Dual Algorithm for Structured Convex Optimization with a Lipschitzian Term

A distributed Douglas-Rachford splitting method for solving linear constrained multi-block weakly convex problems

Convergence of the Preconditioned Proximal Point Method and Douglas-Rachford Splitting in the Absence of Monotonicity

A randomized operator splitting scheme inspired by stochastic optimization methods

Chordal Decomposition in Operator-Splitting Methods for Sparse Semidefinite Programs

Distributed Optimization of Clique-Wise Coupled Problems via Three-Operator Splitting

Convergence Study on Strictly Contractive Peaceman–Rachford Splitting Method for Nonseparable Convex Minimization Models with Quadratic Coupling Terms

Doubly relaxed forward-Douglas--Rachford splitting for the sum of two nonconvex and a DC function

Accelerated Primal-Dual Proximal Gradient Splitting Methods for Convex-Concave Saddle-Point Problems

A parameterized Douglas-Rachford Splitting algorithm for nonconvex optimization

A safe exploration approach to constrained Markov decision processes

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Differentiating Through Integer Linear Programs with Quadratic Regularization and Davis-Yin Splitting

Gradient-Bounded Dynamic Programming with Submodular and Concave Extensible Value Functions