Abstract:We present a novel data-driven strategy to choose the hyperparameter $k$ in the $k$-NN regression estimator without using any hold-out data. We treat the problem of choosing the hyperparameter as an iterative procedure (over $k$) and propose using an easily implemented in practice strategy based on the idea of early stopping and the minimum discrepancy principle. This model selection strategy is proven to be minimax-optimal over some smoothness function classes, for instance, the Lipschitz functions class on a bounded domain. The novel method often improves statistical performance on artificial and real-world data sets in comparison to other model selection strategies, such as the Hold-out method, 5-fold cross-validation, and AIC criterion. The novelty of the strategy comes from reducing the computational time of the model selection procedure while preserving the statistical (minimax) optimality of the resulting estimator. More precisely, given a sample of size $n$, if one should choose $k$ among $\left\{ 1, \ldots, n \right\}$, and $\left\{ f^1, \ldots, f^n \right\}$ are the estimators of the regression function, the minimum discrepancy principle requires the calculation of a fraction of the estimators, while this is not the case for the generalized cross-validation, Akaike's AIC criteria, or Lepskii principle.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the method of selecting the hyper - parameter \(k\) in the \(k - NN\) regression estimator without using any hold - out data. Specifically, the author proposes a new strategy based on the Minimum Discrepancy Principle (MDP) to iteratively select \(k\), and reduces the computation time through the early - stopping technique while maintaining the optimality of statistical performance.
### Background and Motivation
In non - parametric regression, the theoretical performance of the \(k - NN\) regression estimator has been widely studied since the 1970s. However, selecting an appropriate value of \(k\) remains a challenge. Common methods include cross - validation (such as 5 - fold cross - validation), the AIC criterion, etc., but these methods usually need to calculate the estimators corresponding to all possible values of \(k\), which is computationally very expensive, especially in the case of large amounts of data.
### Proposed Method
This paper proposes a new data - driven strategy that uses the Minimum Discrepancy Principle to select the value of \(k\). The main features of this method include:
- **No need for hold - out data**: Traditional cross - validation methods need to divide the data set into a training set and a test set, while the method in this paper is selected entirely based on the training data.
- **Early - stopping**: By monitoring the change in empirical risk, the iteration is stopped when the empirical risk starts to fit the noise, thus avoiding over - fitting.
- **High computational efficiency**: Compared with other methods (such as generalized cross - validation, the AIC criterion, etc.), this method only needs to calculate some estimators, greatly reducing the computation time.
### Theoretical Results
The author proves that the proposed Minimum Discrepancy Principle strategy is statistically optimal on some smooth function classes (such as the Lipschitz function class). Specifically, for a given sample size \(n\), if \(k\) needs to be selected from \(\{1,\ldots,n\}\), the Minimum Discrepancy Principle requires calculating a part of the estimators, while generalized cross - validation, the AIC criterion, or the Lepskii principle requires calculating all estimators.
### Experimental Results
The experimental results show that the proposed method is generally superior to other model selection strategies, such as 5 - fold cross - validation, the Hold - out method, and generalized cross - validation, on both artificial data sets and real data sets. In addition, this method also significantly reduces the computation time of the model selection process.
### Key Contributions
1. **New strategy**: A new data - driven strategy based on the Minimum Discrepancy Principle and the early - stopping technique is proposed for selecting the hyper - parameter \(k\) in \(k - NN\) regression.
2. **Theoretical guarantee**: It is proved that this strategy is statistically optimal on some function classes.
3. **Computational efficiency**: Compared with traditional methods, this method significantly reduces the computation time.
4. **Practical application**: Experiments on multiple data sets prove the effectiveness and superiority of this method.
### Conclusion
This paper proposes a new data - driven strategy for selecting the hyper - parameter \(k\) in \(k - NN\) regression, and proves its effectiveness and superiority both theoretically and experimentally. This method not only performs excellently in statistical performance but also has a significant advantage in computational efficiency.