Abstract:We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i.e., \emph{value generalization among policies}. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40\% performance improvement on its vanilla counterpart in most environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the efficiency of value function approximation (VFA) in reinforcement learning (RL), especially by introducing the policy - extended value function approximator (PeVFA), so as to achieve better value generalization on the policy improvement path. Specifically: 1. **Limitations of value function approximation**: - In traditional RL algorithms, the value function approximator (VFA) can usually only approximate the value of one policy. As the learning process progresses, the value information of the old policy is gradually overwritten, resulting in the inability to retain and utilize the previously learned knowledge. - This limitation makes it difficult to avoid the error of value function approximation during the policy improvement process, which in turn affects the overall optimization effect. 2. **Introduction of PeVFA**: - To solve the above problems, the paper proposes PeVFA, which not only accepts states (and actions) as input, but also additionally accepts explicit policy representations as input. - In this way, PeVFA can retain the values of multiple policies simultaneously and achieve value generalization between different policies. This helps to reduce the initial approximation error of the true values of subsequent policies, thereby improving the efficiency of continuous value approximation. 3. **Policy representation learning**: - The paper also proposes a policy representation learning framework for learning effective low - dimensional embeddings from policy network parameters or state - action pairs. These embeddings can help PeVFA better generalize value estimates. 4. **Theoretical and empirical analysis**: - From both theoretical and empirical perspectives, the paper analyzes the performance of PeVFA in generalized policy iteration (GPI) and proves its value generalization ability on the policy improvement path. - The experimental results show that in the continuous control tasks of OpenAI Gym, the Proximal Policy Optimization (PPO) algorithm based on PeVFA (called PPP - PeVFA) has an average return improvement of about 40% compared to the traditional PPO algorithm. In summary, the main objective of this paper is to improve the efficiency and generalization ability of value function approximation in reinforcement learning by introducing PeVFA and policy representation learning, thereby improving the learning effect in the policy improvement process.

What About Inputing Policy in Value Function: Policy Representation and Policy-extended Value Function Approximator

Represent Your Own Policies: Reinforcement Learning with Policy-extended Value Function Approximator

Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction.

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Provably Efficient Reinforcement Learning with General Value Function Approximation.

Policy Iteration Approximate Dynamic Programming Using Volterra Series Based Actor

Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance

Efficient Deep Reinforcement Learning Through Policy Transfer.

Efficient Deep Reinforcement Learning Via Adaptive Policy Transfer

Generalised Policy Improvement with Geometric Policy Composition

Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

Inverse Policy Evaluation for Value-based Sequential Decision-making

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Linear Function Approximation as a Computationally Efficient Method to Solve Classical Reinforcement Learning Challenges

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

GRAPE: Generalizing Robot Policy via Preference Alignment

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

IV-Posterior: Inverse Value Estimation for Interpretable Policy Certificates