Abstract:We study the problem of estimating the \emph{value} of the largest mean among $K$ distributions via samples from them (rather than estimating \emph{which} distribution has the largest mean), which arises from various machine learning tasks including Q-learning and Monte Carlo tree search. While there have been a few proposed algorithms, their performance analyses have been limited to their biases rather than a precise error metric. In this paper, we propose a novel algorithm called HAVER (Head AVERaging) and analyze its mean squared error. Our analysis reveals that HAVER has a compelling performance in two respects. First, HAVER estimates the maximum mean as well as the oracle who knows the identity of the best distribution and reports its sample mean. Second, perhaps surprisingly, HAVER exhibits even better rates than this oracle when there are many distributions near the best one. Both of these improvements are the first of their kind in the literature, and we also prove that the naive algorithm that reports the largest empirical mean does not achieve these bounds. Finally, we confirm our theoretical findings via numerical experiments including bandits and Q-learning scenarios where HAVER outperforms baseline methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to estimate the maximum mean through samples of these distributions given multiple distributions. Specifically, the researchers focus on how to accurately estimate the maximum mean in a set of distributions in machine learning tasks such as Q - learning and Monte Carlo Tree Search. ### Problem Background In many machine learning tasks, such as Q - learning and Monte Carlo Tree Search, it is necessary to estimate the maximum mean in a set of distributions. Take Q - learning as an example. At each time step, the agent updates its state - action value estimate $\hat{Q}(s, a)$ based on the observed rewards and the value of the next state. This requires an accurate estimate of the maximum state - action value $\max_a Q^*(s', a)$ of the next state. If this estimate is inaccurate, it may have a negative impact on the learning process. ### Limitations of Existing Methods The simplest method is to take the Largest Empirical Mean (LEM), but this will lead to a positive bias. Especially when the number of samples is small or the number of distributions is large, this bias will have an adverse effect on accuracy. Although previous studies have proposed some improved methods, their performance analysis is limited to the direction of the bias or the variance, and does not provide an accurate error metric such as the Mean Squared Error (MSE). ### Main Contributions of the Paper To solve the above problems, this paper proposes a new algorithm named HAVER (Head AVERaging) and analyzes its Mean Squared Error. Research shows that HAVER has the following advantages: 1. **Performance Comparable to Oracle**: The maximum mean estimated by HAVER is as good as an Oracle that knows the identity of the best distribution and reports its sample mean. 2. **Performance Beyond Oracle**: When there are many distributions close to the best distribution, HAVER shows a better convergence rate than Oracle. 3. **Outperforming Baseline Methods**: Through numerical experiments (including multi - armed bandit and Q - learning scenarios), HAVER performs better than existing baseline methods. ### Formula Representation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: - The goal of the maximum mean estimation problem is to estimate: \[ \max_{i \in [K]} \mathbb{E}_{X \sim \nu_i}[X] \] - The Mean Squared Error (MSE) is defined as: \[ \text{MSE}(\hat{\mu})=\mathbb{E}\left[(\hat{\mu}-\mu_1)^2\right] \] - For the HAVER algorithm, the upper bound of its Mean Squared Error can be expressed as: \[ \text{MSE}(\hat{\mu}_{\text{HAVER}})=\tilde{O}\left(\left(\frac{\max_{r \in R} \sum_{i \in B^+(r)} N_i \Delta_i}{\sum_{j \in B^*(r)} N_j}\right)^2 \wedge \frac{1}{N_1}\right)+\cdots \] ### Conclusion In summary, this paper aims to solve the problem of how to more accurately estimate the maximum mean through samples of these distributions given multiple distributions. By introducing the HAVER algorithm, the author not only provides theoretical improvements but also shows better performance in practical applications.

HAVER: Instance-Dependent Error Bounds for Maximum Mean Estimation and Applications to Q-Learning

An Upper Confidence Bound Approach to Estimating the Maximum Mean

Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+α$ Moments

Addressing Maximization Bias in Reinforcement Learning with Two-Sample Testing

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Utilizing Maximum Mean Discrepancy Barycenter for Propagating the Uncertainty of Value Functions in Reinforcement Learning

List Decodable Mean Estimation in Nearly Linear Time

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Near-Optimal Mean Estimation with Unknown, Heteroskedastic Variances

On Estimation and Optimization of Mean Values of Bounded Variables

Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Breaking the Moments Condition Barrier: No-Regret Algorithm for Bandits with Super Heavy-Tailed Payoffs

An Inexact Halpern Iteration with Application to Distributionally Robust Optimization

On the Estimation Bias in Double Q-Learning

Outlier-robust Mean Estimation near the Breakdown Point via Sum-of-Squares

When can Regression-Adjusted Control Variates Help? Rare Events, Sobolev Embedding and Minimax Optimality

Estimation of multiple mean vectors in high dimension

On the Sample Complexity of HGR Maximal Correlation Functions for Large Datasets

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP