Abstract:The Markov Decision Process (MDP) is a widely used mathematical model for sequential decision-making problems. In this paper, we present a new geometric interpretation of MDPs with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This procedure enables the development of a novel class of algorithms for solving MDPs that find optimal policies without explicitly computing policy values. The new algorithms we propose for different settings achieve and, in some cases, improve upon state-of-the-art sample complexity results.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problems of policy evaluation and optimal policy finding in Markov Decision Processes (MDPs) by introducing a new geometric perspective. Specifically, the author proposes a new geometric interpretation method that allows the value function of each state to be adjusted without changing the advantage of any action with respect to any policy. Based on this new geometric interpretation, the author develops a new class of algorithms that can find the optimal policy without explicitly calculating the policy value and, in some cases, improve the sample complexity of existing algorithms. #### Main contributions 1. **New geometric interpretation**: The author shows that the classical MDP framework can be given a geometric representation and introduces a geometric transformation that converts MDP action rewards into a so - called "standard form" while keeping the optimal policy unchanged. For MDPs in standard form, finding the optimal policy becomes very simple. 2. **Value - free MDP - solving algorithms**: Based on the above geometric intuition, the author proposes a new algorithm named "Safe Reward Balancing" (RB - S). This algorithm achieves better convergence performance by modifying the MDP to approach its standard form. The worst - case iteration complexity of the RB - S algorithm is the same as that of Value Iteration (VI), but it performs better in some important cases. For example, in MDPs with a hierarchical structure, the RB - S algorithm can converge in a finite time; when the diagonal elements of the MDP transition matrix are positive, the RB - S algorithm converges faster than VI. 3. **Extension to unknown MDPs**: The author also extends the RB - S algorithm to the case of unknown MDPs (i.e., the standard Q - learning setting). The stochastic version of RB - S does not need to set a learning rate and only needs to maintain a vector, similar to Q - learning, but its sample complexity is better than that of Q - learning and can be accelerated by parallelization. #### Specific problem description 1. **Limitations of traditional MDP - solving methods**: Traditional MDP - solving methods (such as value iteration and policy iteration) rely on state - value estimation, which is computationally complex, especially in large - scale problems. 2. **Motivation for the new method**: The author hopes to simplify the MDP - solving process by introducing a geometric perspective, avoid directly calculating the state value, and thus improve the solving efficiency and convergence speed. 3. **Practical application background**: MDPs are widely used in fields such as reinforcement learning and automatic control, so improving MDP - solving methods has important theoretical and practical significance. In summary, the main goal of this paper is to provide a more efficient MDP - solving method by introducing new geometric interpretations and algorithms, especially to significantly improve performance when dealing with complex and large - scale problems.

MDP Geometry, Normalization and Reward Balancing Solvers

Geometric Active Exploration in Markov Decision Processes: the Benefit of Abstraction

Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming

Fast Online Exact Solutions for Deterministic MDPs with Sparse Rewards

Markov Decision Processes with Time-Varying Geometric Discounting

Safe Exploration And Optimization Of Constrained Mdps Using Gaussian Processes

J-P: MDP. FP. PP.: Characterizing Total Expected Rewards in Markov Decision Processes as Least Fixed Points with an Application to Operational Semantics of Probabilistic Programs (Technical Report)

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

Solving Robust MDPs through No-Regret Dynamics

Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes

Bi-Objective Lexicographic Optimization in Markov Decision Processes with Related Objectives

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Making Linear MDPs Practical via Contrastive Representation Learning

Solving Markov Decision Processes with Reachability Characterization from Mean First Passage Times

Policy Graph Pruning And Optimization In Monte Carlo Value Iteration For Continuous-State Pomdps

Centralized Optimization for Dec-POMDPs under the Expected Average Reward Criterion

A Lazy Abstraction Algorithm for Markov Decision Processes: Theory and Initial Evaluation

Robust Markov Decision Processes: A Place Where AI and Formal Methods Meet

Weighted mesh algorithms for general Markov decision processes: Convergence and tractability

Optimizing Norm-Bounded Weighted Ambiguity Sets for Robust MDPs

Long-Term Resource Allocation Fairness in Average Markov Decision Process (AMDP) Environment