Abstract:Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.

What problem does this paper attempt to address?

This paper attempts to solve the problem of off - policy prediction in Reinforcement Learning (RL). Specifically, the paper focuses on how to effectively estimate the value function of the target policy on the data generated by the given behavior policy. Traditional methods such as the standard off - policy Temporal - Difference (TD) algorithm may diverge when using linear function approximation. Therefore, this paper proposes a new on - off - policy TD algorithm, which ensures the convergence of the algorithm by introducing a penalty term. ### Main contributions of the paper: 1. **Propose a new on - off - policy TD algorithm**: This algorithm introduces a penalty term in each iteration to ensure the stability of parameter updates. 2. **Prove the convergence of the algorithm**: Using existing theoretical tools, it is proved that the new algorithm can converge under specific conditions. 3. **Conduct numerical experiments**: By conducting experiments in standard benchmark environments, the effectiveness of the new algorithm is demonstrated. ### Background and problem definition: - **Markov Decision Process (MDP)**: The paper considers an MDP in the form of \((S, U, p, r, \gamma)\), where \(S\) is the state space, \(U\) is the action set, \(p\) is the transition probability matrix, \(r\) is the immediate reward function, and \(\gamma\) is the discount factor. - **Target Policy**: Denoted as \(\pi\), its value function \(V^\pi\) needs to be estimated. - **Behavior Policy**: Denoted as \(\mu\), which is used to generate data samples. - **Linear approximation of the value function**: Use linear function approximation \(V(s)\approx\theta^T\phi(s)\), where \(\phi(s)\) is the feature vector of state \(s\) and \(\theta\) is the weight vector. ### Algorithm description: - **Update rule**: In each iteration, the algorithm calculates the importance sampling ratio \(\rho_n\) and the modified temporal - difference term \(\delta_n\) according to the current state, action, reward, and the next - state sample, and then updates the parameter \(\theta\). - **Penalty term**: By adding a penalty term \((1 + \eta)\) to the temporal - difference term, the convergence of the algorithm is ensured. ### Convergence analysis: - **Positive definite matrix**: By proving that the matrix \(A=\Phi^T D_\mu((1 + \eta)I-\gamma P_\pi)\Phi\) is positive definite, the convergence of the algorithm is guaranteed. - **Fixed point**: The algorithm finally converges to a fixed point that satisfies \(b - A\theta^*=0\). ### Experimental results: - **Benchmark tests**: The paper conducts experiments on multiple standard off - policy divergence counterexamples, including the "θ → 2θ" example, Baird's 7 - star example, and a 3 - state MDP. - **Performance comparison**: Compared with the existing Emphatic TD(0) and TDC algorithms, the new algorithm shows better stability and convergence in these tests. ### Conclusions and future work: - **Conclusion**: The proposed algorithm successfully solves the divergence problem of the off - policy TD algorithm when using linear function approximation and performs well in multiple benchmark tests. - **Future work**: Further optimize the selection of \(\eta\), extend the algorithm to include eligibility traces, and apply the algorithm to practical problems. Through these contributions, the paper provides a new solution to the off - policy prediction problem and lays the foundation for further research.

A Convergent Off-Policy Temporal Difference Algorithm

Gradient Descent Temporal Difference-Difference Learning

Modified Retrace for Off-Policy Temporal Difference Learning.

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Investigating practical linear temporal difference learning

Almost Sure Convergence of Average Reward Temporal Difference Learning

Consistent On-Line Off-Policy Evaluation

Off-Policy Temporal Difference Learning with Bellman Residuals

Why Target Networks Stabilise Temporal Difference Methods

Target-Based Temporal Difference Learning

Reanalysis of Variance Reduced Temporal Difference Learning

Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

Temporal Difference Learning as Gradient Splitting

A Variance Minimization Approach to Temporal-Difference Learning

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Geometric Insights into the Convergence of Nonlinear TD Learning

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

A Temporal-Difference Approach to Policy Gradient Estimation

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Federated Temporal Difference Learning with Linear Function Approximation under Environmental Heterogeneity

On Convergence Rate of MRetrace