Abstract:Outlier detection is a very hot topic in the area of data management,whose target is to find the objects which are very different from the rest of the data.The techniques of outlier detection can be applied to many fields such as credit card fraud detection,network intrusion detection, environment monitoring and so on.There have been a lot of scholars focusing on developing effective techniques to detect outliers,and a number of excellent approaches have been proposed. Unfortunately,most of the existing methods for outlier computing focus on the centralized processing environment.However,as the data volume increases,the processing efficiency of the traditional centralized methods becomes quite limited and cannot meet the users’increasing requirements.To solve the problem above,in this paper,a novel distributed outlier detection algorithm is proposed for computing outliers in large-scale data sets.Firstly,in the data storage stage (i.e., preprocessing ), the Balance Driven Spatial Partitioning algorithm (BDSP) is proposed to segment the whole data set into a number of small subsets and allocate these subsets to the corresponding computing nodes.The BDSP algorithm can effectively balance the workload of each computing node and achieve a good filtering effect.Furthermore,a new encoding method is designed for the blocks generated by the BDSP algorithm,which can effectively determine the adjacent relationship between the blocks and reduce the network overhead.Based on the methods above,the BDSP-based Outlier Detection algorithm (BOD)is proposed to compute outliers in distributed environments,which includes 2 steps:In the first step,on each computing node, BOD performs batch filtering by utilizing an R-tree index to rapidly compute the local outliers, and obtains a local candidate set.The candidate set consists of the potential outliers that need to be further checked through network communications.Then,in the second step,by using the encoding method,BOD determines the computing nodes that need to communicate to each other, and outputs the final result from the candidate set with a small amount of network overhead.At last,in the experiments,we use a real data set and a series of synthetic data sets to verify the efficiency and effectiveness of BDSP and BOD proposed in this paper.The experimental results show that comparing with the previous approaches,our proposed algorithms can significantly improve the computation efficiency of outlier detection in a distributed environment and drastically reduce the network communication cost in the distributed processing.

CD-Trees: An Efficient Index Structure for Outlier Detection

A Disk-Based Algorithm for Fast Outlier Detection in Large Datasets

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

CDS-Tree: an Effective Index for Clustering Arbitrary Shapes in Data Streams

Ordinal Outlier Detection Based On Recursive Uniform Partitioning

Outlier Detection via Minimum Spanning Tree.

BOD:An Efficient Algorithm for Distributed Outlier Detection

Detecting outliers by clustering algorithms

A Spectral Clustering-Based Dataset Structure Analysis and OutlierDetection Progress

SDROF: outlier detection algorithm based on relative skewness density ratio outlier factor

FAST-ODT: A Lightweight Outlier Detection Scheme for Categorical Data Sets.

A minimum spanning tree-inspired clustering-based outlier detection technique

Ordinal isolation: An efficient and effective intelligent outlier detection algorithm

Efficient nested-loop based outlier detection algorithm for large data set

Outlier Mining Algorithm Based on Data-Partitioning and Density-Grid

An Improved Efficient Algorithm for Detecting Outliers

DC-Tree: Density-Based Clustering Index for Objects in Skewed Distribution

A Spectral Clustering Based Outlier Detection Technique.

A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor

ADD: a new average divergence difference-based outlier detection method with skewed distribution of data objects

An Outlier Detection Technique Based on Spectral Clustering