Abstract:We consider the reinforcement learning (RL) problem with general utilities which consists in maximizing a function of the state-action occupancy measure. Beyond the standard cumulative reward RL setting, this problem includes as particular cases constrained RL, pure exploration and learning from demonstrations among others. For this problem, we propose a simpler single-loop parameter-free normalized policy gradient algorithm. Implementing a recursive momentum variance reduction mechanism, our algorithm achieves $\tilde{\mathcal{O}}(\epsilon^{-3})$ and $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexities for $\epsilon$-first-order stationarity and $\epsilon$-global optimality respectively, under adequate assumptions. We further address the setting of large finite state action spaces via linear function approximation of the occupancy measure and show a $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity for a simple policy gradient method with a linear regression subroutine.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is dealing with problems having general utility functions in reinforcement learning. Specifically, the paper focuses on state - action pair functions that maximize the state - action occupancy measure. This not only covers the standard cumulative - reward reinforcement learning settings, but also includes multiple specific cases such as constrained reinforcement learning, pure exploration, learning from demonstrations, etc. The main objective of the paper is to propose a simpler single - loop parameter - free normalized policy gradient algorithm. Through the recursive momentum variance reduction mechanism, this algorithm achieves sample complexities of ˜O(ϵ−3) and ˜O(ϵ−2) in reaching ε - first - order stationary points and ε - global optimal solutions respectively. In addition, the paper also explores linear function approximation methods in large - scale finite state - action spaces to solve large - scale problems and shows the ˜O(ϵ−4) sample complexity of this method in reaching ε - first - order stationary points. ### Background and Motivation of the Paper Traditional reinforcement learning (RL) mainly focuses on learning a policy through interaction with the environment to maximize the expected cumulative reward. However, in many practical problems, the goal may involve more complex utility functions, such as risk - sensitive or risk - averse RL, constrained RL, pure exploration, etc. These problems usually require optimizing a nonlinear function regarding the state - action occupancy measure rather than a simple cumulative reward. Therefore, traditional dynamic programming methods are no longer applicable, and new algorithms and techniques are required to solve such problems. ### Main Contributions 1. **New Single - Loop Normalized Policy Gradient Algorithm (N - VR - PG)** - A new single - loop normalized policy gradient algorithm is proposed. Each iteration round only requires one trajectory, and there is no need to know the specific parameters of the problem in advance, large batches or checkpoints. - The use of the normalized update rule avoids the gradient clipping mechanism, thus simplifying the algorithm design. - A recursive double - variance reduction mechanism is implemented, combined with momentum techniques, for stochastic policy gradients and occupancy measure estimators. 2. **Normalized Gradient Update Guarantees Bounded IS Weights** - It is proved that using the normalized gradient update can automatically guarantee the boundedness of the IS weights without additional assumptions. - This feature is particularly important when dealing with large - scale problems because the variance control of the IS weights is the key to variance reduction. 3. **Sample Complexity Analysis** - In the finite state - action space, the algorithm requires ˜O(ε−3) samples to reach ε - first - order stationary points and ˜O(ε−2) samples to reach ε - global optimal solutions. - For the continuous state - action space, when using the Gaussian policy, the sample complexity of ˜O(ε−3) can also be achieved. 4. **Linear Function Approximation in Large - Scale State - Action Spaces** - A method of using linear function approximation of the occupancy measure in large - scale finite state - action spaces is proposed, which is implemented through the least mean square error solver. - It is proved that this method requires ˜O(ε−4) sample complexity when reaching ε - first - order stationary points. ### Related Work - **Variance - Reducing Policy Gradients in Standard Reinforcement Learning**: In recent years, much work has been devoted to reducing the policy gradient variance in standard RL problems, such as using importance sampling and momentum techniques. - **General - Utility Reinforcement Learning**: Early research mainly focused on control problems with non - standard utilities, such as inventory problems and variance - penalized MDPs. Recent work has proposed direct policy search methods to solve general - utility RL problems through variational policy gradients. ### Conclusion This paper solves the reinforcement learning problems with general utility functions by proposing a new single - loop normalized policy gradient algorithm. The algorithm performs well in terms of sample complexity and variance reduction, is suitable for large - scale state - action spaces, and is of great significance both in theory and in practical applications.

Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Policy Gradient for Reinforcement Learning with General Utilities

On the Sample Complexity of a Policy Gradient Algorithm with Occupancy Approximation for General Utility Reinforcement Learning

Policy Optimization over General State and Action Spaces

Exact Reduction of Huge Action Spaces in General Reinforcement Learning

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

Control Regularization for Reduced Variance Reinforcement Learning

Provably Efficient Reinforcement Learning with Linear Function Approximation

Variance aware reward smoothing for deep reinforcement learning

Learning Parsimonious Dynamics for Generalization in Reinforcement Learning

Non-Linear Reinforcement Learning in Large Action Spaces: Structural Conditions and Sample-efficiency of Posterior Sampling

Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

Implicitly Regularized RL with Implicit Q-Values

Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation.

Efficient sample reuse in policy gradients with parameter-based exploration

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces