Abstract:The goal of reinforcement learning is estimating a policy that maps states to actions and maximizes the cumulative reward of a Markov Decision Process (MDP). This is oftentimes achieved by estimating first the optimal (reward) value function (VF) associated with each state-action pair. When the MDP has an infinite horizon, the optimal VFs and policies are stationary under mild conditions. However, in finite-horizon MDPs, the VFs (hence, the policies) vary with time. This poses a challenge since the number of VFs to estimate grows not only with the size of the state-action space but also with the time horizon. This paper proposes a non-parametric low-rank stochastic algorithm to approximate the VFs of finite-horizon MDPs. First, we represent the (unknown) VFs as a multi-dimensional array, or tensor, where time is one of the dimensions. Then, we use rewards sampled from the MDP to estimate the optimal VFs. More precisely, we use the (truncated) PARAFAC decomposition to design an online low-rank algorithm that recovers the entries of the tensor of VFs. The size of the low-rank PARAFAC model grows additively with respect to each of its dimensions, rendering our approach efficient, as demonstrated via numerical experiments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently estimate the optimal value functions (VFs) in the Markov decision process (MDP) with a finite - horizon. Specifically: 1. **Problem Background**: - In the MDP with an infinite - horizon, the optimal policy and value function are usually time - independent (stationary), which makes them relatively easy to estimate. - However, in the MDP with a finite - horizon, the policy and value function are time - varying. This means that the number of value functions to be estimated increases not only with the size of the state - action space but also with the length of the time - horizon. This leads to the "curse of dimensionality", making the estimation very difficult. 2. **Limitations of Existing Methods**: - For the infinite - horizon problem, a lot of research has proposed effective value function approximation methods, such as using neural networks (NNs) and linear models. - But for the finite - horizon problem, the existing value function approximation methods are not yet mature and efficient enough. 3. **The Method Proposed in the Paper**: - The paper introduces a new method based on tensor low - rank decomposition, especially using PARAFAC decomposition to approximate the value functions within the finite - horizon. - By representing the value function as a multi - dimensional array (tensor), where time is one dimension, and using sample rewards to estimate these value functions. - An online low - rank algorithm is proposed to recover the entries of the value function tensor using truncated PARAFAC decomposition. 4. **Main Contributions**: - This method can effectively reduce the number of parameters to be estimated, thus alleviating the curse of dimensionality problem. - The performance of this method in two different environments, including a grid - world environment and a wireless communication environment, has been verified through numerical experiments, showing its high efficiency and accuracy. In summary, this paper aims to solve the high - complexity problem of value function estimation in the finite - horizon MDP and proposes an efficient estimation method based on tensor low - rank decomposition.

Tensor Low-rank Approximation of Finite-horizon Value Functions

Tensor and Matrix Low-Rank Value-Function Approximation in Reinforcement Learning

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

Matrix Low-Rank Approximation For Policy Gradient Methods

Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction.

Value function approximation via low-rank models

Efficient Model-Free Exploration in Low-Rank MDPs

A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation

Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation

Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

Uncertainty-Aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Matrix Estimation for Offline Reinforcement Learning with Low-Rank Structure

Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization