Dhawal Gupta,Scott M. Jordan,Shreyas Chaudhari,Bo Liu,Philip S. Thomas,Bruno Castro da Silva
Abstract:In this paper, we introduce a fresh perspective on the challenges of credit assignment and policy evaluation. First, we delve into the nuances of eligibility traces and explore instances where their updates may result in unexpected credit assignment to preceding states. From this investigation emerges the concept of a novel value function, which we refer to as the \emph{bidirectional value function}. Unlike traditional state value functions, bidirectional value functions account for both future expected returns (rewards anticipated from the current state onward) and past expected returns (cumulative rewards from the episode's start to the present). We derive principled update equations to learn this value function and, through experimentation, demonstrate its efficacy in enhancing the process of policy evaluation. In particular, our results indicate that the proposed learning approach can, in certain challenging contexts, perform policy evaluation more rapidly than TD($\lambda$) -- a method that learns forward value functions, $v^\pi$, \emph{directly}. Overall, our findings present a new perspective on eligibility traces and potential advantages associated with the novel value function it inspires, especially for policy evaluation.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the credit assignment problem in reinforcement learning, especially the unexpected credit assignment problems that may occur in the traditional TD(λ) method when using non - linear function approximators (such as neural networks) for policy evaluation. Specifically, the paper focuses on how to improve the TD(λ) method so that it is more in line with the expected behavior when updating the values of previous states, especially when dealing with complex non - linear function approximators.
### Main Contributions
1. **Proposing a New Perspective**: The paper re - examines the concept of eligibility traces and reveals specific scenarios that may lead to unexpected credit assignment to previous states in some cases.
2. **Introducing the Bidirectional Value Function**: The paper introduces a new value function - the bidirectional value function, which not only considers future expected returns but also past expected returns, thus providing a broader view than the traditional state value function.
3. **Deriving the Update Equation**: The paper derives the principled update equations for learning the bidirectional value function and emphasizes the applicability of these equations in practical applications.
4. **Experimental Verification**: It is experimentally proven that the bidirectional value function can perform better than the traditional TD(λ) method in policy evaluation, especially in cases involving complex non - linear approximators.
### Background and Motivation
- **Reinforcement Learning Framework**: The paper first introduces the basic concepts of reinforcement learning and the definition of Markov decision processes (MDP).
- **Policy Evaluation**: The goal of policy evaluation is to predict the future returns of a given policy, which is an important sub - routine for policy improvement.
- **Problems with the TD(λ) Method**: When using non - linear function approximators, the traditional TD(λ) method may produce inconsistent update directions due to relying on outdated gradient memories, resulting in value updates of previous states that are not in line with expectations.
### Methodology
- **Definition of the Bidirectional Value Function**: The paper defines the bidirectional value function (←→v), which is the sum of the forward value function (→v) and the backward value function (←v).
- **Bellman Equation**: The paper derives the Bellman equations for the bidirectional value function and the backward value function and proves that the Bellman operators corresponding to these equations are contraction mappings, ensuring convergence in the tabular setting.
- **Online Incremental Update Equation**: The paper derives the online incremental update equations for learning the bidirectional value function and the backward value function, and these equations can be updated at a fixed computational cost at each step.
### Experiments
- **Parameterization**: The paper explores how to parameterize the three value functions (←→v, ←v, →v) so that learning one value function is helpful for the learning of the other two value functions.
- **Policy Evaluation Performance**: The paper evaluates the practicality of these value functions in standard prediction tasks, especially in the chain domain, and verifies the superior performance of the bidirectional value function in policy evaluation.
Through these contributions, the paper provides a new perspective on the credit assignment problem in reinforcement learning and proposes effective solutions, especially when dealing with complex non - linear function approximators.