Abstract:Outlier detection is a very hot topic in the area of data management,whose target is to find the objects which are very different from the rest of the data.The techniques of outlier detection can be applied to many fields such as credit card fraud detection,network intrusion detection, environment monitoring and so on.There have been a lot of scholars focusing on developing effective techniques to detect outliers,and a number of excellent approaches have been proposed. Unfortunately,most of the existing methods for outlier computing focus on the centralized processing environment.However,as the data volume increases,the processing efficiency of the traditional centralized methods becomes quite limited and cannot meet the users’increasing requirements.To solve the problem above,in this paper,a novel distributed outlier detection algorithm is proposed for computing outliers in large-scale data sets.Firstly,in the data storage stage (i.e., preprocessing ), the Balance Driven Spatial Partitioning algorithm (BDSP) is proposed to segment the whole data set into a number of small subsets and allocate these subsets to the corresponding computing nodes.The BDSP algorithm can effectively balance the workload of each computing node and achieve a good filtering effect.Furthermore,a new encoding method is designed for the blocks generated by the BDSP algorithm,which can effectively determine the adjacent relationship between the blocks and reduce the network overhead.Based on the methods above,the BDSP-based Outlier Detection algorithm (BOD)is proposed to compute outliers in distributed environments,which includes 2 steps:In the first step,on each computing node, BOD performs batch filtering by utilizing an R-tree index to rapidly compute the local outliers, and obtains a local candidate set.The candidate set consists of the potential outliers that need to be further checked through network communications.Then,in the second step,by using the encoding method,BOD determines the computing nodes that need to communicate to each other, and outputs the final result from the candidate set with a small amount of network overhead.At last,in the experiments,we use a real data set and a series of synthetic data sets to verify the efficiency and effectiveness of BDSP and BOD proposed in this paper.The experimental results show that comparing with the previous approaches,our proposed algorithms can significantly improve the computation efficiency of outlier detection in a distributed environment and drastically reduce the network communication cost in the distributed processing.

A Disk-Based Algorithm for Fast Outlier Detection in Large Datasets

CD-Trees: An Efficient Index Structure for Outlier Detection

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Outlier Mining Algorithm Based on Data-Partitioning and Density-Grid

A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor

Efficient nested-loop based outlier detection algorithm for large data set

Double-Clustering based outlier detection algorithm for large datasets

An efficient algorithm for distributed density-based outlier detection on big data.

BOD:An Efficient Algorithm for Distributed Outlier Detection

Density-based Outlier Detection in Multi-dimensional Datasets.

A Improved Clustering and Outlier Detection Algorithm

An Improved Efficient Algorithm for Detecting Outliers

Ordinal Outlier Detection Based On Recursive Uniform Partitioning

A Spectral Clustering-Based Dataset Structure Analysis and OutlierDetection Progress

A Fast Outlier Detection Method for Big Data.

Enhancing Effectiveness of Density-Based Outlier Mining

Novel Clustering-Based Approach for Local Outlier Detection

Detecting outliers by clustering algorithms

Research On Algorithms For Mining Distance-Based Outliers

A Spectral Clustering Based Outlier Detection Technique.

Efficient Outlier Detection for High-Dimensional Data