Abstract:Reinforcement learning utilizing kernel ridge regression to predict the expected value function represents a powerful method with great representational capacity. This setting is a highly versatile framework amenable to analytical results. We consider kernel-based function approximation for RL in the infinite horizon average reward setting, also referred to as the undiscounted setting. We propose an optimistic algorithm, similar to acquisition function based algorithms in the special case of bandits. We establish novel no-regret performance guarantees for our algorithm, under kernel-based modelling assumptions. Additionally, we derive a novel confidence interval for the kernel-based prediction of the expected value function, applicable across various RL problems.

What problem does this paper attempt to address?

This paper attempts to solve the problem of using kernel ridge regression for function approximation in the average - reward reinforcement learning (RL) setting within an infinite - time horizon. Specifically, the authors focus on reinforcement learning problems in the non - discounted or average - reward setting, which is different from the traditional discounted setting and episodic setting. Such problems are suitable for tasks that require continuous operation without episodes, such as load balancing and stock market operations. ### Main Problems 1. **Insufficient Theoretical Understanding**: Compared with other settings (such as episodic setting and discounted setting), the theoretical understanding of reinforcement learning algorithms in the non - discounted setting is relatively limited. 2. **Large - Scale State - Action Spaces**: Many practical problems have very large or potentially infinite state - action spaces, making tabular methods difficult to apply. 3. **Non - Linear Function Approximation**: Most of the existing work focuses on linear models, while kernel methods can handle more complex non - linear function approximation problems. ### Paper Contributions To solve the above problems, this paper proposes the first reinforcement learning algorithm using non - linear function approximation (based on kernel ridge regression) within an infinite - time horizon, named KUCB - RL (Kernel - based Upper Confidence Bound for Reinforcement Learning). The main contributions are as follows: 1. **No - Regret Guarantees**: The authors establish no - regret performance guarantees for the proposed KUCB - RL algorithm, which is achieved for the first time in this setting. 2. **Novel Confidence Intervals**: A new kernel - based confidence interval applicable to various reinforcement learning problems is derived, which plays a key role in ultimately improving the results. 3. **Applicable to Different Types of Kernel Functions**: Specific regret bounds are given for very smooth kernel functions (such as the squared - exponential kernel) and kernel functions with polynomial eigenvalue decay (such as the Matérn kernel and the NT kernel), respectively. ### Core Formulas The key formulas involved in the paper include: - Kernel ridge regression predictor and uncertainty estimation: \[ \hat{f}_t(z) = k_t(z)^\top (K_t + \rho I)^{-1} y_t \] \[ \sigma^2_t(z) = k(z, z) - k_t(z)^\top (K_t + \rho I)^{-1} k_t(z) \] where \( k_t(z) = [k(z, z_1), k(z, z_2), \ldots, k(z, z_t)]^\top \), \( K_t = [k(z_i, z_j)]_{i,j = 1}^t \), \(\rho>0\) is a regularization parameter. - Width multiplier \(\beta(\delta)\) of the confidence interval: \[ |f(z) - \hat{f}_t(z)| \leq \beta(\delta) \sigma_t(z) \] where \(\beta(\delta)\) depends on the confidence level \(1-\delta\) and specific assumptions. Through these contributions, the paper significantly advances the understanding of reinforcement learning in the infinite - time - horizon average - reward setting and provides a solid foundation for future research.

Kernel-Based Function Approximation for Average Reward Reinforcement Learning: An Optimist No-Regret Algorithm

Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Provably Efficient Reinforcement Learning with Linear Function Approximation

Provably Efficient Reinforcement Learning Via Surprise Bound

Optimism in Reinforcement Learning with Generalized Linear Function Approximation.

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Optimistic Q-learning for average reward and episodic reinforcement learning

Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation.

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback

Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning

Kernel-Based Decentralized Policy Evaluation for Reinforcement Learning

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Generalized linear function approximation