Abstract:Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: how to effectively represent and optimize policies in Markov Decision Processes (MDP). Specifically, the paper focuses on the representation and optimization of policies in intelligent decision - making systems. Due to the large - scale and high - complexity of the policy space, especially in real - world scenarios, this problem becomes particularly difficult. Therefore, the paper proposes a new unified policy abstraction theory and a policy representation learning method based on deep metric learning to improve policy learning, evaluation, and optimization. ### Main contributions of the paper 1. **Unified policy abstraction theory**: - Three types of policy abstractions are proposed: distribution - irrelevance abstraction, influence - irrelevance abstraction, and value - irrelevance abstraction. Each abstraction aggregates policies based on different criteria. - These abstraction relationships are generalized into three policy metrics for quantifying the distance (i.e., similarity) between policies, thus being more conveniently applied to policy representation learning. 2. **Policy representation learning method based on deep metric learning**: - A loss function based on the alignment principle (alignment loss) is proposed to learn policy representations by minimizing the difference between the distances of policy embeddings and policy metrics. - The Maximum Mean Discrepancy (MMD) is used to efficiently estimate policy metrics, and the Layer - wise Permutation - invariant Encoder (LPE) is adopted for structure - aware encoding. 3. **Experimental verification**: - Extensive experiments are carried out on policy optimization and evaluation problems, including Trust Region Policy Optimization (TRPO), Diversity - Guided Evolutionary Strategy (DGES), and Off - Policy Evaluation (OPE). - The experimental results show that different types of policy abstractions and metrics exhibit different effects in different downstream tasks, and there is no universally optimal abstraction method; but in general, the influence - irrelevance abstraction may be a better choice. ### Definition of policy abstraction According to the paper, policy abstraction is defined as a mapping from the original policy space to the abstract space. The specific definitions are as follows: - **Distribution - irrelevance abstraction (fπ)**: If two policies have the same action distribution in all states, then their representations in the abstract space are also the same. \[ f_{\pi}(\pi_i) = f_{\pi}(\pi_j) \implies \pi_i(a|s) = \pi_j(a|s), \forall s \in S, a \in A \] - **Influence - irrelevance abstraction (fPπ)**: If two policies have the same transition distribution caused in all states, then their representations in the abstract space are also the same. \[ f_{P\pi}(\pi_i) = f_{P\pi}(\pi_j) \implies P_{\pi_i}(s'|s) = P_{\pi_j}(s'|s), \forall s, s' \in S \] - **Value - irrelevance abstraction (fVπ)**: If two policies have the same value function in all states, then their representations in the abstract space are also the same. \[ f_{V\pi}(\pi_i) = f_{V\pi}(\pi_j) \implies V_{\pi_i}(s) = V_{\pi_j}(s), \forall s \in S \] ### Definition of policy metrics To quantify the similarity between policies, the paper defines three policy metrics: - **Distribution - irrelevance metric (dπ)**: \[ d_{\pi}(\pi_i, \pi_j) = \mathbb{E}_{s \sim p(s)}[D

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Metric Policy Representations for Opponent Modeling

Policy-Independent Behavioral Metric-Based Representation for Deep Reinforcement Learning

Representation Learning For Efficient Deep Multi-Agent Reinforcement Learning

Representation learning for continuous action spaces is beneficial for efficient policy learning

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Bridging State and History Representations: Understanding Self-Predictive RL

1-2-3-Go! Policy Synthesis for Parameterized Markov Decision Processes via Decision-Tree Learning and Generalization

Towards Learning Generalizable Driving Policies from Restricted Latent Representations

Relative Policy-Transition Optimization for Fast Policy Transfer

Off-policy Evaluation with Deeply-abstracted States

Intrinsically Motivated Hierarchical Policy Learning in Multi-objective Markov Decision Processes

Model-Based Decentralized Policy Optimization

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning.

Representation-Driven Reinforcement Learning

Policy-conditioned Environment Models Are More Generalizable

Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes

Geometric Active Exploration in Markov Decision Processes: the Benefit of Abstraction

DyPS: Dynamic Parameter Sharing in Multi-Agent Reinforcement Learning for Spatio-Temporal Resource Allocation