Jiayi Huang,Han Zhong,Liwei Wang,Lin F. Yang
Abstract:To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in reinforcement learning (RL) with universal function approximation, can efficient, horizon - free and instance - dependent learning be achieved?
### Specific problem description:
1. **Horizon - Free Regret**:
- In reinforcement learning problems with long planning horizons, existing algorithms usually have a strong dependence on the time horizon \(H\). This makes them less efficient when dealing with long - term planning problems.
- The paper proposes a new algorithm UCRL - WVTR, which can work without depending on the time horizon, thus improving the efficiency of dealing with long - term planning problems.
2. **Instance - Dependent Regret**:
- Traditional worst - case regret bounds are often too loose and cannot accurately reflect the complexity of specific problems.
- Instance - dependent regret bounds can provide tighter guarantees according to the characteristics of specific problems, so they are more practical.
3. **General Function Approximation**:
- Existing research mainly focuses on tabular or linear - mixture Markov decision processes (MDPs), and these assumptions are often not very practical in the real world.
- The goal of the paper is to achieve the above two properties (horizon - free and instance - dependent) under a more general function approximation framework to deal with a wider range of practical problems.
### Solutions:
To achieve these goals, the paper proposes the following key techniques and methods:
1. **Novel algorithm design**:
- **Weighted Value - Targeted Regression (WVTR)**: By assigning different weights to different data points, the sub - optimality caused by heterogeneous noise levels is reduced.
- **High - Order Moment Estimator**: It is used to estimate the variance of the next - state value function more accurately.
2. **Fine - grained theoretical analysis**:
- A novel Bernstein - style concentration inequality for weighted nonlinear least squares is proposed to ensure the accuracy of model estimation.
- A detailed analysis of the high - order expansion of the general function class is carried out, and a tight instance - dependent regret bound is obtained.
3. **Computational efficiency**:
- By introducing a regression oracle, efficient computation is achieved, making the algorithm feasible in practice.
### Main contributions:
- **For the first time, the horizon - free and instance - dependent regret bounds are achieved**, which are applicable to reinforcement learning problems with general function approximation.
- **Strict theoretical proofs and experimental verifications are provided**, indicating that the new algorithm can work effectively in multiple settings.
In conclusion, through the introduction of new algorithms and techniques, this paper solves the key problem of achieving efficient, horizon - free and instance - dependent learning in reinforcement learning with universal function approximation.