What About Inputing Policy in Value Function: Policy Representation and Policy-extended Value Function Approximator

Hongyao Tang,Zhaopeng Meng,Jianye Hao,Chen Chen,Daniel Graves,Dong Li,Changmin Yu,Hangyu Mao,Wulong Liu,Yaodong Yang,Wenyuan Tao,Li Wang
DOI: https://doi.org/10.48550/arXiv.2010.09536
2021-12-16
Abstract:We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i.e., \emph{value generalization among policies}. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40\% performance improvement on its vanilla counterpart in most environments.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the efficiency of value function approximation (VFA) in reinforcement learning (RL), especially by introducing the policy - extended value function approximator (PeVFA), so as to achieve better value generalization on the policy improvement path. Specifically: 1. **Limitations of value function approximation**: - In traditional RL algorithms, the value function approximator (VFA) can usually only approximate the value of one policy. As the learning process progresses, the value information of the old policy is gradually overwritten, resulting in the inability to retain and utilize the previously learned knowledge. - This limitation makes it difficult to avoid the error of value function approximation during the policy improvement process, which in turn affects the overall optimization effect. 2. **Introduction of PeVFA**: - To solve the above problems, the paper proposes PeVFA, which not only accepts states (and actions) as input, but also additionally accepts explicit policy representations as input. - In this way, PeVFA can retain the values of multiple policies simultaneously and achieve value generalization between different policies. This helps to reduce the initial approximation error of the true values of subsequent policies, thereby improving the efficiency of continuous value approximation. 3. **Policy representation learning**: - The paper also proposes a policy representation learning framework for learning effective low - dimensional embeddings from policy network parameters or state - action pairs. These embeddings can help PeVFA better generalize value estimates. 4. **Theoretical and empirical analysis**: - From both theoretical and empirical perspectives, the paper analyzes the performance of PeVFA in generalized policy iteration (GPI) and proves its value generalization ability on the policy improvement path. - The experimental results show that in the continuous control tasks of OpenAI Gym, the Proximal Policy Optimization (PPO) algorithm based on PeVFA (called PPP - PeVFA) has an average return improvement of about 40% compared to the traditional PPO algorithm. In summary, the main objective of this paper is to improve the efficiency and generalization ability of value function approximation in reinforcement learning by introducing PeVFA and policy representation learning, thereby improving the learning effect in the policy improvement process.