Abstract:Outlier detection is a technique in data mining that aims to detect unusual or unexpected records in the dataset. Existing outlier detection algorithms have different pros and cons and exhibit different sensitivity to noisy data such as extreme values. In this paper, we propose a novel cluster-based outlier detection algorithm named MSD-Kmeans that combines the statistical method of Mean and Standard Deviation (MSD) and the machine learning clustering algorithm K-means to detect outliers more accurately with the better control of extreme values. There are two phases in this combination method of MSD-Kmeans: (1) applying MSD algorithm to eliminate as many noisy data to minimize the interference on clusters, and (2) applying K-means algorithm to obtain local optimal clusters. We evaluate our algorithm and demonstrate its effectiveness in the context of detecting possible overcharging of taxi fares, as greedy dishonest drivers may attempt to charge high fares by detouring. We compare the performance indicators of MSD-Kmeans with those of other outlier detection algorithms, such as MSD, K-means, Z-score, MIQR and LOF, and prove that the proposed MSD-Kmeans algorithm achieves the highest measure of precision, accuracy, and F-measure. We conclude that MSD-Kmeans can be used for effective and efficient outlier detection on data of varying quality on IoT devices.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the scalability of existing outlier detection algorithms when dealing with large and complex data sets and their sensitivity to noisy data (such as extreme values). Specifically: 1. **Limitations of existing algorithms**: Existing outlier detection algorithms have their own advantages and disadvantages, and show different sensitivities to noisy data (for example, extreme values), which may affect the effect and accuracy of clustering. 2. **Requirements of specific application scenarios**: Especially in the application scenario of taxi fare fraud detection, a more effective method is needed to identify abnormal driving routes and fares to deal with the behavior of some greedy drivers who increase fares by taking detours. To solve these problems, the author proposes a new clustering - based outlier detection algorithm - MSD - Kmeans. This algorithm combines the mean and standard deviation (MSD) in statistical methods and the K - means clustering algorithm in machine learning, aiming to detect global and local outliers more accurately and at the same time better control the influence of extreme values. ### Main contributions 1. **Propose a new algorithm**: Propose the MSD - Kmeans algorithm, which combines the advantages of MSD and K - means. 2. **Application and verification**: Apply MSD - Kmeans to the New York City yellow taxi data set to verify its effectiveness in identifying possible taxi fare fraud. 3. **Performance comparison**: Conduct a performance comparison with other outlier detection algorithms (such as MSD, K - means, Z - score, MIQR and LOF), and prove that MSD - Kmeans performs excellently in terms of precision, accuracy and F - measure. ### Algorithm process MSD - Kmeans is divided into two stages: 1. **First stage (MSD)**: Use the mean and standard deviation (the formulas are as follows) to eliminate as many global outliers (extreme values) as possible to reduce the interference to subsequent clustering. - Calculate the mean \(\mu\): \[ \mu=\frac{1}{n} \sum_{i = 1}^{n}x_i \] - Calculate the standard deviation \(\sigma\): \[ \sigma=\sqrt{\frac{1}{n} \sum_{i = 1}^{n}(x_i-\mu)^2} \] - Define normal values \(N\) and global outliers \(S\): \[ N>\mu-\sigma \quad \text{and} \quad N<\mu+\sigma \] \[ S>\mu+\sigma \quad \text{or} \quad S<\mu-\sigma \] 2. **Second stage (K - means)**: Conduct K - means clustering based on the remaining normal data to obtain locally optimal clustering and further detect local outliers. Through this two - stage method, MSD - Kmeans can detect outliers more effectively while reducing the interference of noisy data.

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers

MSD-Kmeans: A Hybrid Algorithm for Efficient Detection of Global and Local Outliers

A neighborhood weighted-based method for the detection of outliers

A fast MST-inspired kNN-based outlier detection method

D.MCA: Outlier Detection with Explicit Micro-Cluster Assignments

Effective Outlier Detection for Ensuring Data Quality in Flotation Data Modelling Using Machine Learning (ML) Algorithms

A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor

Application of k-sigma Algorithm Based on Monte Carlo in Outlier Detection

Intelligent Identification and Order-Sensitive Correction Method of Outliers from Multi-Data Source Based on Historical Data Mining

Detecting outliers by clustering algorithms

On Saving Outliers for Better Clustering over Noisy Data.

Human-in-the-loop Outlier Detection.

Outlier detection method based on high-density iteration

Data Mining Based Outlier Cluster Detection Algorithm

Efficient and Robust KPI Outlier Detection for Large-Scale Datacenters

Detecting Outliers in Data with Correlated Measures

MS2OD: Outlier Detection Using Minimum Spanning Tree and Medoid Selection

A Scalable Algorithm for Detecting Community Outliers in Social Networks.

A method for outlier detection based on cluster analysis and visual expert criteria

Outlier Detection with Cluster Catch Digraphs

ADD: a new average divergence difference-based outlier detection method with skewed distribution of data objects