Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Min Zhang,Hongyao Tang,Jianye Hao,Yan Zheng
DOI: https://doi.org/10.48550/arXiv.2209.07696
2022-09-16
Abstract:Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.
Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: how to effectively represent and optimize policies in Markov Decision Processes (MDP). Specifically, the paper focuses on the representation and optimization of policies in intelligent decision - making systems. Due to the large - scale and high - complexity of the policy space, especially in real - world scenarios, this problem becomes particularly difficult. Therefore, the paper proposes a new unified policy abstraction theory and a policy representation learning method based on deep metric learning to improve policy learning, evaluation, and optimization. ### Main contributions of the paper 1. **Unified policy abstraction theory**: - Three types of policy abstractions are proposed: distribution - irrelevance abstraction, influence - irrelevance abstraction, and value - irrelevance abstraction. Each abstraction aggregates policies based on different criteria. - These abstraction relationships are generalized into three policy metrics for quantifying the distance (i.e., similarity) between policies, thus being more conveniently applied to policy representation learning. 2. **Policy representation learning method based on deep metric learning**: - A loss function based on the alignment principle (alignment loss) is proposed to learn policy representations by minimizing the difference between the distances of policy embeddings and policy metrics. - The Maximum Mean Discrepancy (MMD) is used to efficiently estimate policy metrics, and the Layer - wise Permutation - invariant Encoder (LPE) is adopted for structure - aware encoding. 3. **Experimental verification**: - Extensive experiments are carried out on policy optimization and evaluation problems, including Trust Region Policy Optimization (TRPO), Diversity - Guided Evolutionary Strategy (DGES), and Off - Policy Evaluation (OPE). - The experimental results show that different types of policy abstractions and metrics exhibit different effects in different downstream tasks, and there is no universally optimal abstraction method; but in general, the influence - irrelevance abstraction may be a better choice. ### Definition of policy abstraction According to the paper, policy abstraction is defined as a mapping from the original policy space to the abstract space. The specific definitions are as follows: - **Distribution - irrelevance abstraction (fπ)**: If two policies have the same action distribution in all states, then their representations in the abstract space are also the same. \[ f_{\pi}(\pi_i) = f_{\pi}(\pi_j) \implies \pi_i(a|s) = \pi_j(a|s), \forall s \in S, a \in A \] - **Influence - irrelevance abstraction (fPπ)**: If two policies have the same transition distribution caused in all states, then their representations in the abstract space are also the same. \[ f_{P\pi}(\pi_i) = f_{P\pi}(\pi_j) \implies P_{\pi_i}(s'|s) = P_{\pi_j}(s'|s), \forall s, s' \in S \] - **Value - irrelevance abstraction (fVπ)**: If two policies have the same value function in all states, then their representations in the abstract space are also the same. \[ f_{V\pi}(\pi_i) = f_{V\pi}(\pi_j) \implies V_{\pi_i}(s) = V_{\pi_j}(s), \forall s \in S \] ### Definition of policy metrics To quantify the similarity between policies, the paper defines three policy metrics: - **Distribution - irrelevance metric (dπ)**: \[ d_{\pi}(\pi_i, \pi_j) = \mathbb{E}_{s \sim p(s)}[D