Tensor Low-rank Approximation of Finite-horizon Value Functions

Sergio Rozada,Antonio G. Marques
2024-05-28
Abstract:The goal of reinforcement learning is estimating a policy that maps states to actions and maximizes the cumulative reward of a Markov Decision Process (MDP). This is oftentimes achieved by estimating first the optimal (reward) value function (VF) associated with each state-action pair. When the MDP has an infinite horizon, the optimal VFs and policies are stationary under mild conditions. However, in finite-horizon MDPs, the VFs (hence, the policies) vary with time. This poses a challenge since the number of VFs to estimate grows not only with the size of the state-action space but also with the time horizon. This paper proposes a non-parametric low-rank stochastic algorithm to approximate the VFs of finite-horizon MDPs. First, we represent the (unknown) VFs as a multi-dimensional array, or tensor, where time is one of the dimensions. Then, we use rewards sampled from the MDP to estimate the optimal VFs. More precisely, we use the (truncated) PARAFAC decomposition to design an online low-rank algorithm that recovers the entries of the tensor of VFs. The size of the low-rank PARAFAC model grows additively with respect to each of its dimensions, rendering our approach efficient, as demonstrated via numerical experiments.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently estimate the optimal value functions (VFs) in the Markov decision process (MDP) with a finite - horizon. Specifically: 1. **Problem Background**: - In the MDP with an infinite - horizon, the optimal policy and value function are usually time - independent (stationary), which makes them relatively easy to estimate. - However, in the MDP with a finite - horizon, the policy and value function are time - varying. This means that the number of value functions to be estimated increases not only with the size of the state - action space but also with the length of the time - horizon. This leads to the "curse of dimensionality", making the estimation very difficult. 2. **Limitations of Existing Methods**: - For the infinite - horizon problem, a lot of research has proposed effective value function approximation methods, such as using neural networks (NNs) and linear models. - But for the finite - horizon problem, the existing value function approximation methods are not yet mature and efficient enough. 3. **The Method Proposed in the Paper**: - The paper introduces a new method based on tensor low - rank decomposition, especially using PARAFAC decomposition to approximate the value functions within the finite - horizon. - By representing the value function as a multi - dimensional array (tensor), where time is one dimension, and using sample rewards to estimate these value functions. - An online low - rank algorithm is proposed to recover the entries of the value function tensor using truncated PARAFAC decomposition. 4. **Main Contributions**: - This method can effectively reduce the number of parameters to be estimated, thus alleviating the curse of dimensionality problem. - The performance of this method in two different environments, including a grid - world environment and a wireless communication environment, has been verified through numerical experiments, showing its high efficiency and accuracy. In summary, this paper aims to solve the high - complexity problem of value function estimation in the finite - horizon MDP and proposes an efficient estimation method based on tensor low - rank decomposition.