Marcos Matabuena,Juan C. Vidal,Oscar Hernan Madrid Padilla,Jukka-Pekka Onnela
Abstract:In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach incorporates a robust uncertainty quantification mechanism, leveraging our prior estimation work on conditional mean and variance. The employment of kNN ensures scalable computational efficiency in predicting intervals and statistical accuracy in line with optimal non-parametric rates. Additionally, we introduce a new kNN semi-parametric algorithm for estimating ROC curves, accounting for covariates. For selecting the smoothing parameter k, we propose an algorithm with theoretical guarantees.Incorporation of variable selection enhances the performance of the method significantly over conventional kNN techniques in various modeling tasks. We validate the approach through simulations in low, moderate, and high-dimensional covariate spaces. The algorithm's effectiveness is particularly notable in biomedical applications as demonstrated in two case studies. Concluding with a theoretical analysis, we highlight the consistency and convergence rate of our method over traditional kNN models, particularly when the underlying regression model takes values in a low-dimensional space.
What problem does this paper attempt to address?
This paper introduces a regression method based on the k-nearest neighbors (kNN) algorithm, aiming to combine the scalability and adaptability of the traditional nonparametric kNN model and introducing new variable selection techniques. This method focuses on estimating the conditional mean and variance of the random response variable, thus providing a comprehensive description of the conditional distribution in different scenarios. It incorporates a robust uncertainty quantification mechanism by leveraging previous work on conditional mean and variance estimation. The kNN algorithm guarantees the computational efficiency of prediction intervals and statistical accuracy, and provides optimal nonparametric convergence rates for low-dimensional manifold structure regression models.
The paper proposes a new kNN semiparametric algorithm for estimating receiver operating characteristic (ROC) curves, taking into account the influence of covariates. A algorithm with theoretical guarantees is proposed for selecting the smoothing parameter k. The variable selection method significantly improves the performance of the model, especially in multivariate regression tasks. The effectiveness of the method is verified through simulation experiments, particularly in two case studies in biomedical applications, demonstrating its potential in large-scale medical research.
The main contributions of the paper are as follows:
1. Providing an efficient kNN prediction framework that improves the performance of kNN regression methods in estimating the mean and variance functions.
2. Introducing innovative variable selection strategies to enhance the interpretability and convergence speed of the model.
3. Proposing data-driven rules for selecting the k parameter, enhancing the accuracy of mean and variance estimation.
4. Effectively recovering the conditional distribution of various scales and location models, surpassing traditional nonparametric local conditional distribution methods.
5. Proposing a new method for estimating prediction intervals with computational efficiency advantages compared to nonparametric methods.
6. Estimating ROC curves in the presence of covariates, suitable for validating biomarkers.
The paper is well-structured, including mathematical models, theoretical analysis, simulation analysis, and practical application cases. These methods extend the applications of the kNN model, particularly in handling high-dimensional data and biomedical research, providing a nonparametric alternative to traditional parametric models.