On high-dimensional modifications of the nearest neighbor classifier

Annesha Ghosh,Deep Ghoshal,Bilol Banerjee,Anil K. Ghosh
2024-10-24
Abstract:Nearest neighbor classifier is arguably the most simple and popular nonparametric classifier available in the literature. However, due to the concentration of pairwise distances and the violation of the neighborhood structure, this classifier often suffers in high-dimension, low-sample size (HDLSS) situations, especially when the scale difference between the competing classes dominates their location difference. Several attempts have been made in the literature to take care of this problem. In this article, we discuss some of these existing methods and propose some new ones. We carry out some theoretical investigations in this regard and analyze several simulated and benchmark datasets to compare the empirical performances of proposed methods with some of the existing ones.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation of the nearest neighbor classifier in the case of high - dimensional low - sample - size (HDLSS), especially when the scale differences between different classes are greater than the location differences. Specifically: 1. **Concentration of distances in high - dimensional data**: In high - dimensional space, the distances between different points tend to concentrate, which leads to the destruction of the neighbor structure of the nearest neighbor classifier, thus affecting its classification effect. For example, in high dimensions, even if the distribution centers of two classes are significantly different, due to the influence of scale differences, the nearest neighbor classifier may not be able to effectively distinguish these classes. 2. **Limitations of existing methods**: Although some improved methods have been proposed in the existing literature to deal with this problem, these methods still have deficiencies in certain specific situations (such as scale problems). For example, the scale adjustment method (CH classifier) proposed by Chan and Hall performs well in dealing with location - scale problems, but its performance is still poor when dealing with cases where only the scale is different. 3. **Proposing new solutions**: The paper proposes an improved scale - adjusted nearest neighbor classifier (Modified Chan and Hall classifier, MCH classifier), and verifies its superior performance in high - dimensional data in multiple experiments. In addition, the paper also explores the classification method based on the minimum - distance feature (MDist classifier), which performs well in dealing with complex situations such as mixed distributions. Through these methods, the paper aims to improve the classification performance of the nearest neighbor classifier in the case of high - dimensional low - sample - size, especially when there are significant scale differences between different classes.