Abstract:To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in reinforcement learning (RL) with universal function approximation, can efficient, horizon - free and instance - dependent learning be achieved? ### Specific problem description: 1. **Horizon - Free Regret**: - In reinforcement learning problems with long planning horizons, existing algorithms usually have a strong dependence on the time horizon \(H\). This makes them less efficient when dealing with long - term planning problems. - The paper proposes a new algorithm UCRL - WVTR, which can work without depending on the time horizon, thus improving the efficiency of dealing with long - term planning problems. 2. **Instance - Dependent Regret**: - Traditional worst - case regret bounds are often too loose and cannot accurately reflect the complexity of specific problems. - Instance - dependent regret bounds can provide tighter guarantees according to the characteristics of specific problems, so they are more practical. 3. **General Function Approximation**: - Existing research mainly focuses on tabular or linear - mixture Markov decision processes (MDPs), and these assumptions are often not very practical in the real world. - The goal of the paper is to achieve the above two properties (horizon - free and instance - dependent) under a more general function approximation framework to deal with a wider range of practical problems. ### Solutions: To achieve these goals, the paper proposes the following key techniques and methods: 1. **Novel algorithm design**: - **Weighted Value - Targeted Regression (WVTR)**: By assigning different weights to different data points, the sub - optimality caused by heterogeneous noise levels is reduced. - **High - Order Moment Estimator**: It is used to estimate the variance of the next - state value function more accurately. 2. **Fine - grained theoretical analysis**: - A novel Bernstein - style concentration inequality for weighted nonlinear least squares is proposed to ensure the accuracy of model estimation. - A detailed analysis of the high - order expansion of the general function class is carried out, and a tight instance - dependent regret bound is obtained. 3. **Computational efficiency**: - By introducing a regression oracle, efficient computation is achieved, making the algorithm feasible in practice. ### Main contributions: - **For the first time, the horizon - free and instance - dependent regret bounds are achieved**, which are applicable to reinforcement learning problems with general function approximation. - **Strict theoretical proofs and experimental verifications are provided**, indicating that the new algorithm can work effectively in multiple settings. In conclusion, through the introduction of new algorithms and techniques, this paper solves the key problem of achieving efficient, horizon - free and instance - dependent learning in reinforcement learning with universal function approximation.

Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

Horizon-Free Regret for Linear Markov Decision Processes

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

Provably Efficient Reinforcement Learning Via Surprise Bound

Non-stationary Reinforcement Learning under General Function Approximation

Prior-dependent analysis of posterior sampling reinforcement learning with function approximation

Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

Refined Regret for Adversarial MDPs with Linear Function Approximation

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation.

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Nonstationary Reinforcement Learning with Linear Function Approximation