Homophily Outlier Detection in Non-IID Categorical Data

Guansong Pang,Longbing Cao,Ling Chen
DOI: https://doi.org/10.48550/arXiv.2103.11516
2021-03-22
Abstract:Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Existing outlier detection methods perform poorly when dealing with non - independent and identically distributed (non - IID) categorical data**. Specifically, most existing outlier detection methods assume that the outlier factors (i.e., outlier - ness score metrics) of data entities are independent and identically distributed (IID). However, in practical applications, the outlier - ness of different entities is usually interdependent and comes from different probability distributions (non - IID). The failure of this assumption may lead to the inability to detect important outliers, especially in the case of high - dimensional data with many noisy features. ### Core problems of the paper 1. **Non - IID characteristics**: Data in the real world usually has non - IID characteristics, that is, the outlier - ness of different entities is interdependent and may come from different probability distributions. 2. **Limitations of existing methods**: Most existing outlier detection methods are based on the IID assumption, ignoring the coupling relationships and heterogeneity between entities, which leads to their poor performance in dealing with complex data. 3. **Challenges**: In the case of high - dimensional data and a large number of noisy features, existing methods are difficult to accurately identify outliers. ### Solutions To solve the above problems, this paper proposes a new framework - **Coupled Unsupervised Outlier Detection (CUOT)**, and two instance methods: **Coupled Biased Random Walks (CBRW)** and **multiple - granularity Subgraph Densities - augmented Random Walks (SDRW)**. These methods improve outlier detection in the following ways: 1. **Introducing non - IID outlier factors**: The CUOT framework considers the coupling relationships between values and the heterogeneous distribution, thereby more accurately capturing non - IID characteristics. 2. **Graph representation and mining**: By constructing a value - value graph, CUOT can effectively model and propagate outlier - ness, thereby improving the detection accuracy. 3. **Fine - grained outlier - ness scoring**: CUOT can not only directly detect outliers, but also perform outlier feature selection, further enhancing its flexibility and practicality. ### Main contributions 1. **New task definition**: Propose a new outlier detection task, that is, detecting outliers in non - IID multi - dimensional data. 2. **New framework**: Propose the CUOT framework to estimate the outlier - ness of each value by modeling homogeneous coupling and heterogeneous distribution. 3. **Instance methods**: Implement two methods, CBRW and SDRW, which are respectively used to model outlier - ness propagation on directed attribute - value graphs and undirected value graphs. 4. **Theoretical and experimental proof**: Prove theoretically and experimentally that these methods can not only handle non - IID outlier behaviors, but also deal with data containing a large number of noisy features or low outlier separability. 5. **Dataset complexity quantification**: Propose four - level data complexity indicators and provide corresponding datasets to promote outlier detection research on complex data. Through these improvements, CUOT and its instance methods significantly improve the outlier detection performance on a variety of complex datasets.