Abstract:Estimating a policy that maps states to actions is a central problem in reinforcement learning. Traditionally, policies are inferred from the so called value functions (VFs), but exact VF computation suffers from the curse of dimensionality. Policy gradient (PG) methods bypass this by learning directly a parametric stochastic policy. Typically, the parameters of the policy are estimated using neural networks (NNs) tuned via stochastic gradient descent. However, finding adequate NN architectures can be challenging, and convergence issues are common as well. In this paper, we put forth low-rank matrix-based models to estimate efficiently the parameters of PG algorithms. We collect the parameters of the stochastic policy into a matrix, and then, we leverage matrix-completion techniques to promote (enforce) low rank. We demonstrate via numerical studies how low-rank matrix-based policy models reduce the computational and sample complexities relative to NN models, while achieving a similar aggregated reward.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the computational complexity and sample complexity issues faced by the Policy Gradient (PG) method in Reinforcement Learning (RL). Specifically, traditional methods usually rely on neural networks (NNs) to estimate policy parameters, but this method has the following challenges: 1. **Difficulty in architecture selection**: Finding a suitable neural network architecture is a difficult problem, and different tasks require different architectures. 2. **Convergence problems**: Convergence problems are easily encountered during the neural network training process, especially in high - dimensional state spaces. 3. **Computational and sample complexity**: Neural network models usually have a large number of parameters, resulting in high computational costs and the need for a large amount of sample data. To solve these problems, this paper proposes a Low - Rank Policy Gradient (LRPG) method based on low - rank matrices. By organizing the policy parameters into matrices and using matrix completion techniques to promote low - rank structures, the number of parameters can be effectively reduced and the generalization ability of the model can be improved. Specifically, the main contributions of this paper include: - **Low - rank matrix modeling**: Represent the mean and standard deviation parameters of the policy as low - rank matrices, reducing the number of parameters and alleviating the curse of dimensionality problem. - **Efficient parameter estimation**: Use low - rank matrix decomposition techniques for parameter estimation, reducing computational complexity and sample complexity. - **Experimental verification**: Experiments were carried out in three standard continuous - action reinforcement learning tasks to verify the effectiveness of the LRPG method and show its advantages in parameter efficiency, convergence speed, and return. Through these improvements, the LRPG method can not only achieve cumulative rewards similar to those of neural - network - based methods, but also shows significant advantages in the number of parameters and convergence speed.

Matrix Low-Rank Approximation For Policy Gradient Methods

Matrix Low-Rank Trust Region Policy Optimization

Tensor and Matrix Low-Rank Value-Function Approximation in Reinforcement Learning

Matrix Estimation for Offline Reinforcement Learning with Low-Rank Structure

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Stochastic Cubic-Regularized Policy Gradient Method

Tensor Low-rank Approximation of Finite-horizon Value Functions

Stochastic Variance-Reduced Policy Gradient

Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation

Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

Model-free Policy Learning with Reward Gradients

Linear Function Approximation as a Computationally Efficient Method to Solve Classical Reinforcement Learning Challenges

Hessian Aided Policy Gradient

Policy Gradient for Rectangular Robust Markov Decision Processes

Factored Policy Gradients: Leveraging Structure for Efficient Learning in MOMDPs

Efficient sample reuse in policy gradients with parameter-based exploration

Elementary Analysis of Policy Gradient Methods

Learning Optimal Deterministic Policies with Stochastic Policy Gradients

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning