Evidential instance selection for K-nearest neighbor classification of big data
Chaoyu Gong,Zhi-gang Su,Pei-hong Wang,Qian Wang,Yang You
DOI: https://doi.org/10.1016/j.ijar.2021.08.006
IF: 4.452
2021-11-01
International Journal of Approximate Reasoning
Abstract:Many instance selection algorithms have been introduced to reduce the high storage requirements and computation complexity of K-nearest neighbor (K-NN) classification rules. However, the information provided by the neighbors of one instance was still not completely utilized in many studies. The information is usually in the form of a quantitative metric for determining whether an instance can be selected. Thus, many instances may have the same quality, which confuses the selection results. In addition, the proposed metrics are simply added without deeper fusion and the information loss has further negative effects. To address these issues, we propose a new instance selection algorithm for K-NN rules in the evidence theory framework called evidential instance selection (EIS). The basic idea is that all neighbors of every instance first provide distinct items of evidence regarding the estimated value of the label (called the estimation label) for each instance. After fusing the items of evidence and computing the conflicts among them, instances with higher conflict are considered more likely to be near the class boundaries. Finally, the selection of boundary instances is formalized as solving an optimal problem, where the objective function considers both the reduction rate and classification accuracy. When dealing with big data sets, EIS is enhanced as a distributed and parallel version called EIS-AS by applying Apache Spark to alleviate the computational bottleneck. We tested EIS and EIS-AS with 30 small data sets and six big data sets, respectively, which contained up to 11 million instances. The experimental results showed that EIS performed well at simplifying the raw training data and EIS-AS could cope with big data sets in an appropriate manner.
computer science, artificial intelligence