Exploring a High-quality Outlying Feature Value Set for Noise-Resilient Outlier Detection in Categorical Data.

Hongzuo Xu,Yongjun Wang,Li Cheng,Yijie Wang,Xingkong Ma
DOI: https://doi.org/10.1145/3269206.3271721
2018-01-01
Abstract:Unavoidable noise in real-world categorical data presents significant challenges to existing outlier detection methods because they normally fail to separate noisy values from outlying values. Feature subspace-based methods inevitably mix noisy values when retaining an entire feature because a feature may contain both outlying values and noisy values. Pattern-based methods are normally based on frequency and are easily misled by noisy values, resulting in many faulty patterns. This paper introduces a novel unsupervised framework termed OUVAS, and its parameter-free instantiation RHAC to explore a high-quality outlying value set for detecting outliers in noisy categorical data. Based on the observation that the relations between values reflect their essence, OUVAS investigates value similarities to cluster values into different groups and combines cluster-level analysis and value-level refinement to identify an outlying value set. RHAC instantiates OUVAS by three successive modules (i.e., the combination of Ochiai coefficient and LOUVAIN algorithm to cluster values, hierarchical value coupling learning to perform cluster-level analysis, and a threshold to divide fake and real outlying values in value-level refinement). We show that (i) RHAC-based outlier detector significantly outperforms five state-of-the-art outlier detection methods; (ii) Extended RHAC-based feature selection method successfully improves the performance of existing outlier detectors and performs better than two latest outlying feature selection methods.
What problem does this paper attempt to address?