A Fast kNN-Based MST Outlier Detection Method
Li ZHU,Yuan-Yuan QIU,Shuai YU,Sheng YUAN
DOI: https://doi.org/10.11897/SP.J.1016.2017.02856
2017-01-01
Chinese Journal of Computers
Abstract:Outlier detection,also known as anomaly detection,is a very important foundamental research task in the field of data mining.It is mainly used for finding strange mechanism or potential danger,and aims to detecting those outliers their observations deviate so much from other observations and they are few suspicious data.Outliers,which are novel,unmoral and few,are often abandoned as noise or abnormal data.Outliers are also classified as many types,such as local,partial and so on.The techniques of outlier detection can be applied to many fields such as intrusion behavior,fraud,signs of early disease in the medical field and so on.Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection.The kNN-based algorithm could be used in big data sets efficiently,so it is widely applied for outliers detection based on distance and density.Unfortunately,the kNN-based algorithm's time complexity is O(N2),and it will be greatly increased with the size of date sets.The time complexity and space complexity of minimum spanning tree-based clustering algorithms using Prim's or Kruskal's method is O(N2),and the result of clustering depends on inputting parameters by users.Moreover,this algorithm can't detect outliers in high-density clusters.The existing MST-based algorithms become ineffective when provided with unsuitable parameters or applied to datasets which are composed of clusters with diverse shapes,sizes,and densities.Meanwhile,the most MST algorithms couldn't build tree dynamically,because of needing to know the distance between any two points in advance.In order to address these challenging problems,we proposed a new outliers detection method,which absorbs the advantages of distance-based method and density-based method.Firstly,this algorithm builds a split-tree to storage the information among data points.Secondly,we efficiently acquire all sets of well-separated pair decomposition on the whole dataset.Thirdly,all this algorithm partitions the input data set into several frames which are satisfy certain condition so that we can quickly obtain each point's k-nearest neighbors on the basis of the first two results.Fourthly,a minimum spinning tree is dynamically built according to the third result.In addition,we rank points which are suspected as outliers on the basis of its outlier factor by using the MST-based clustering without inputting parameter of cluster numbers manually.A new algorithm and a new metric are proposed to select the exact number of clusters and avoid insignificant clusters.And we detect all outliers at last.The time complexity of computing kNN and creating tree are O(kN) and O(NlogN),respectively.The experiments show that this new algorithm can detect both local outliers and global outliers without inputting the number of clusters from users.In the experiments,we use a series of real datasets and synthetic datasets to verify the efficiency and effectiveness of KDNS,FkNN and ADC proposed in this paper.The experimental results show that comparing with the previous approaches,our proposed algorithms can drastically reduce time complexity and significantly improve the rate of outlier detection.