BOD:An Efficient Algorithm for Distributed Outlier Detection
Xi-Te WANG,De-Rong SHEN,Mei BAI,Tie-Zheng NIE,Yue KOU,Ge YU
DOI: https://doi.org/10.11897/SP.J.1016.2016.00036
2016-01-01
Abstract:Outlier detection is a very hot topic in the area of data management,whose target is to find the objects which are very different from the rest of the data.The techniques of outlier detection can be applied to many fields such as credit card fraud detection,network intrusion detection, environment monitoring and so on.There have been a lot of scholars focusing on developing effective techniques to detect outliers,and a number of excellent approaches have been proposed. Unfortunately,most of the existing methods for outlier computing focus on the centralized processing environment.However,as the data volume increases,the processing efficiency of the traditional centralized methods becomes quite limited and cannot meet the users’increasing requirements.To solve the problem above,in this paper,a novel distributed outlier detection algorithm is proposed for computing outliers in large-scale data sets.Firstly,in the data storage stage (i.e., preprocessing ), the Balance Driven Spatial Partitioning algorithm (BDSP) is proposed to segment the whole data set into a number of small subsets and allocate these subsets to the corresponding computing nodes.The BDSP algorithm can effectively balance the workload of each computing node and achieve a good filtering effect.Furthermore,a new encoding method is designed for the blocks generated by the BDSP algorithm,which can effectively determine the adjacent relationship between the blocks and reduce the network overhead.Based on the methods above,the BDSP-based Outlier Detection algorithm (BOD)is proposed to compute outliers in distributed environments,which includes 2 steps:In the first step,on each computing node, BOD performs batch filtering by utilizing an R-tree index to rapidly compute the local outliers, and obtains a local candidate set.The candidate set consists of the potential outliers that need to be further checked through network communications.Then,in the second step,by using the encoding method,BOD determines the computing nodes that need to communicate to each other, and outputs the final result from the candidate set with a small amount of network overhead.At last,in the experiments,we use a real data set and a series of synthetic data sets to verify the efficiency and effectiveness of BDSP and BOD proposed in this paper.The experimental results show that comparing with the previous approaches,our proposed algorithms can significantly improve the computation efficiency of outlier detection in a distributed environment and drastically reduce the network communication cost in the distributed processing.