Parallel multi-label K-nearest neighbor algorithm based on Spark

Jin WANG,Cui-ping XIA,Wei-hua OUYANG,Hong WANG,Xin DENG,Qiao-song CHEN
DOI: https://doi.org/10.3969/j.issn.1007-130X.2017.02.002
2017-01-01
Abstract:With the advent of big data era,applications of large-scale multi-label data mining have attracted extensive attention.The Multi-Label K-Nearest Neighbor (ML-KNN) is a simple,efficient and widely used method which outperforms other traditional multi-label learning algorithms in many realworld applications.However,as an increasing number of data need to be dealt with,the ML-KNN algorithm is unable to meet the requirements of time and memory space.Combined with the parallel mechanism and iterative computation in the memory of Spark,we propose an algorithm based on Spark distributed in-memory computing platform,named SML-KNN.First,in the stage of map,we try to find the K nearest neighbors for each partition of the samples to be tested.Then in the reduce stage,we determine the final K nearest neighbors according to the K nearest neighbors of each partition.Finally,we cluster the label sets of the K nearest neighbors in parallel,and output the target label sets using the maximum posterior probability (MAP) principle.The experiments in stand-alone and cluster environments show that in the premise of ensuring the classification accuracy,the performance of the SML-KNN has an approximate linear relationship with computing resources,and the proposed algorithm can enhance the processing ability of the ML-KNN when dealing with large scale multi-label data.
What problem does this paper attempt to address?