Unsupervised Parameter-free Outlier Detection using HDBSCAN* Outlier Profiles

Kushankur Ghosh,Murilo Coelho Naldi,Jörg Sander,Euijin Choo
2024-11-14
Abstract:In machine learning and data mining, outliers are data points that significantly differ from the dataset and often introduce irrelevant information that can induce bias in its statistics and models. Therefore, unsupervised methods are crucial to detect outliers if there is limited or no information about them. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN*, a state-of-the-art hierarchical clustering method. GLOSH estimates outlier scores for each data point by comparing its density to the highest density of the region they reside in the HDBSCAN* hierarchy. GLOSH may be sensitive to HDBSCAN*'s minpts parameter that influences density estimation. With limited knowledge about the data, choosing an appropriate minpts value beforehand is challenging as one or some minpts values may better represent the underlying cluster structure than others. Additionally, in the process of searching for ``potential outliers'', one has to define the number of outliers n a dataset has, which may be impractical and is often unknown. In this paper, we propose an unsupervised strategy to find the ``best'' minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset. Moreover, we propose an unsupervised strategy to estimate a threshold for classifying points into inliers and (potential) outliers without the need to pre-define any value. Our experiments show that our strategies can automatically find the minpts value and threshold that yield the best or near best outlier detection results using GLOSH.
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve two key problems in unsupervised anomaly detection: 1. **Selecting an appropriate value for the `min pts` parameter**: HDBSCAN* is a hierarchical clustering method, and its performance depends on the setting of the parameter `min pts`. `min pts` affects density estimation and, in turn, the calculation of GLOSH (Hierarchy - based Global - Local Outlier Score). However, in practical applications, the intrinsic distribution of data is usually unknown, so it is challenging to select an appropriate `min pts` value. Different `min pts` values may lead to different clustering structures, thereby affecting the detection effect of outliers. 2. **Automatically determining the classification threshold between outliers and normal points**: When detecting outliers, it is usually necessary to pre - define a threshold to distinguish potential outliers from normal points. However, in practical applications, the number of outliers is often unknown, and it is very difficult to pre - define this threshold. ### Overview of solutions To solve the above problems, the authors propose the following strategies: 1. **GLOSH - Profile**: By constructing the GLOSH score sequence (i.e., GLOSH - Profile) of each data point under different `min pts` values, the change patterns of these scores are studied. The authors find that when `min pts` reaches a certain value, the rate of change of GLOSH scores tends to be consistent, indicating that the `min pts` value at this time can better reflect the real clustering structure of the data. 2. **Auto - GLOSH**: Based on GLOSH - Profile, a method for automatically selecting the best `min pts` value is proposed. Specifically, by calculating the GLOSH score ranking difference (using Pearson correlation measure) between adjacent `min pts` values, the `min pts` value at which the rate of change begins to be consistent is found. 3. **PO - LAR (Potential Outlier Labelling AppRoach)**: Using the GLOSH distribution under the best `min pts` value selected by Auto - GLOSH, a method for automatically determining the classification threshold is proposed. This method does not need to pre - define the number of outliers but automatically finds a reasonable threshold according to the distribution of GLOSH scores. ### Experimental results The experimental results show that the proposed strategies can automatically find the best `min pts` value and classification threshold without human intervention, thereby achieving a good anomaly detection effect. In particular, for different types (global, local, and clustered) of outliers, the Auto - GLOSH and PO - LAR methods both show good robustness and accuracy. ### Summary This paper solves the problems of parameter selection and threshold determination in unsupervised anomaly detection by introducing methods such as GLOSH - Profile and Auto - GLOSH, providing an effective way to improve the performance of HDBSCAN* in anomaly detection tasks.