A DP Canopy K-Means Algorithm for Privacy Preservation of Hadoop Platform.

Tao Shang,Zheng Zhao,Zhenyu Guan,Jianwei Liu
DOI: https://doi.org/10.1007/978-3-319-69471-9_14
2017-01-01
Abstract:K-means algorithm for data mining is combined with differential privacy preservation. Although it improves the security of data information, the selection of clustering number and initial center point is still blind and random. In this paper, we integrate an optimized Canopy algorithm with DP K-means algorithm, and apply it to Hadoop platform. Firstly, we optimize the Canopy algorithm according to the minimum and maximum principle and use the functions of the MapReduce framework to implement it. Secondly, we utilize the number and the set of center points obtained to implement the DP K-means algorithm on MapReduce. As a result, the improved Canopy algorithm can optimize the selection of the number of centers and clusters on Hadoop platform, so the proposed K-means algorithm can improve security, usability and efficiency of calculation.
What problem does this paper attempt to address?