On Saving Outliers for Better Clustering over Noisy Data.
Shaoxu Song,Fei Gao,Ruihong Huang,Yihan Wang
DOI: https://doi.org/10.1145/3448016.3457271
2021-01-01
Abstract:Clustering is often distracted by errors, frequently observed in almost all areas, ranging from online questionnaire to sensor reading in IoT. The dirty data values not only make themselves (the corresponding tuples) outlying, but also mislead the clustering of remaining tuples, e.g., mistakenly splitting a cluster into two or distorting the cluster center. The reason is that the traditional clustering methods either simply ignore the outliers such as DBSCAN or assign them to the closest clusters anyway, e.g., in K-Means. In this paper, we propose to save the outliers for better clustering. The idea is to adjust the erroneous values (often minimally) of the outlier in order to make it appear normally. That is, the tuples after adjusting values are no longer outlying, and thus will be clustered without distracting others. The outlier saving by value adjustment is designed to work with any clustering methods (e.g., DBSCAN or K-Means). Our technical contributions include: (1) showing NPhardness of the outlier saving problem for clustering, (2) deriving lower and upper bounds of the optimal solutions, and (3) devising approximation algorithm with performance guarantees referring to the aforesaid bounds. Experiments on datasets with real-world outliers demonstrate the higher accuracy of our proposal, compared to the state-of-the-art approaches. Remarkably, we show that the adjusted data with outlier saving indeed improve significantly clustering, as well as other applications such as classification and record matching.