Kernel-Based Function Approximation for Average Reward Reinforcement Learning: An Optimist No-Regret Algorithm

Sattar Vakili,Julia Olkhovskaya
2024-10-31
Abstract:Reinforcement learning utilizing kernel ridge regression to predict the expected value function represents a powerful method with great representational capacity. This setting is a highly versatile framework amenable to analytical results. We consider kernel-based function approximation for RL in the infinite horizon average reward setting, also referred to as the undiscounted setting. We propose an optimistic algorithm, similar to acquisition function based algorithms in the special case of bandits. We establish novel no-regret performance guarantees for our algorithm, under kernel-based modelling assumptions. Additionally, we derive a novel confidence interval for the kernel-based prediction of the expected value function, applicable across various RL problems.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of using kernel ridge regression for function approximation in the average - reward reinforcement learning (RL) setting within an infinite - time horizon. Specifically, the authors focus on reinforcement learning problems in the non - discounted or average - reward setting, which is different from the traditional discounted setting and episodic setting. Such problems are suitable for tasks that require continuous operation without episodes, such as load balancing and stock market operations. ### Main Problems 1. **Insufficient Theoretical Understanding**: Compared with other settings (such as episodic setting and discounted setting), the theoretical understanding of reinforcement learning algorithms in the non - discounted setting is relatively limited. 2. **Large - Scale State - Action Spaces**: Many practical problems have very large or potentially infinite state - action spaces, making tabular methods difficult to apply. 3. **Non - Linear Function Approximation**: Most of the existing work focuses on linear models, while kernel methods can handle more complex non - linear function approximation problems. ### Paper Contributions To solve the above problems, this paper proposes the first reinforcement learning algorithm using non - linear function approximation (based on kernel ridge regression) within an infinite - time horizon, named KUCB - RL (Kernel - based Upper Confidence Bound for Reinforcement Learning). The main contributions are as follows: 1. **No - Regret Guarantees**: The authors establish no - regret performance guarantees for the proposed KUCB - RL algorithm, which is achieved for the first time in this setting. 2. **Novel Confidence Intervals**: A new kernel - based confidence interval applicable to various reinforcement learning problems is derived, which plays a key role in ultimately improving the results. 3. **Applicable to Different Types of Kernel Functions**: Specific regret bounds are given for very smooth kernel functions (such as the squared - exponential kernel) and kernel functions with polynomial eigenvalue decay (such as the Matérn kernel and the NT kernel), respectively. ### Core Formulas The key formulas involved in the paper include: - Kernel ridge regression predictor and uncertainty estimation: \[ \hat{f}_t(z) = k_t(z)^\top (K_t + \rho I)^{-1} y_t \] \[ \sigma^2_t(z) = k(z, z) - k_t(z)^\top (K_t + \rho I)^{-1} k_t(z) \] where \( k_t(z) = [k(z, z_1), k(z, z_2), \ldots, k(z, z_t)]^\top \), \( K_t = [k(z_i, z_j)]_{i,j = 1}^t \), \(\rho>0\) is a regularization parameter. - Width multiplier \(\beta(\delta)\) of the confidence interval: \[ |f(z) - \hat{f}_t(z)| \leq \beta(\delta) \sigma_t(z) \] where \(\beta(\delta)\) depends on the confidence level \(1-\delta\) and specific assumptions. Through these contributions, the paper significantly advances the understanding of reinforcement learning in the infinite - time - horizon average - reward setting and provides a solid foundation for future research.