Abstract:Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Existing outlier detection methods perform poorly when dealing with non - independent and identically distributed (non - IID) categorical data**. Specifically, most existing outlier detection methods assume that the outlier factors (i.e., outlier - ness score metrics) of data entities are independent and identically distributed (IID). However, in practical applications, the outlier - ness of different entities is usually interdependent and comes from different probability distributions (non - IID). The failure of this assumption may lead to the inability to detect important outliers, especially in the case of high - dimensional data with many noisy features. ### Core problems of the paper 1. **Non - IID characteristics**: Data in the real world usually has non - IID characteristics, that is, the outlier - ness of different entities is interdependent and may come from different probability distributions. 2. **Limitations of existing methods**: Most existing outlier detection methods are based on the IID assumption, ignoring the coupling relationships and heterogeneity between entities, which leads to their poor performance in dealing with complex data. 3. **Challenges**: In the case of high - dimensional data and a large number of noisy features, existing methods are difficult to accurately identify outliers. ### Solutions To solve the above problems, this paper proposes a new framework - **Coupled Unsupervised Outlier Detection (CUOT)**, and two instance methods: **Coupled Biased Random Walks (CBRW)** and **multiple - granularity Subgraph Densities - augmented Random Walks (SDRW)**. These methods improve outlier detection in the following ways: 1. **Introducing non - IID outlier factors**: The CUOT framework considers the coupling relationships between values and the heterogeneous distribution, thereby more accurately capturing non - IID characteristics. 2. **Graph representation and mining**: By constructing a value - value graph, CUOT can effectively model and propagate outlier - ness, thereby improving the detection accuracy. 3. **Fine - grained outlier - ness scoring**: CUOT can not only directly detect outliers, but also perform outlier feature selection, further enhancing its flexibility and practicality. ### Main contributions 1. **New task definition**: Propose a new outlier detection task, that is, detecting outliers in non - IID multi - dimensional data. 2. **New framework**: Propose the CUOT framework to estimate the outlier - ness of each value by modeling homogeneous coupling and heterogeneous distribution. 3. **Instance methods**: Implement two methods, CBRW and SDRW, which are respectively used to model outlier - ness propagation on directed attribute - value graphs and undirected value graphs. 4. **Theoretical and experimental proof**: Prove theoretically and experimentally that these methods can not only handle non - IID outlier behaviors, but also deal with data containing a large number of noisy features or low outlier separability. 5. **Dataset complexity quantification**: Propose four - level data complexity indicators and provide corresponding datasets to promote outlier detection research on complex data. Through these improvements, CUOT and its instance methods significantly improve the outlier detection performance on a variety of complex datasets.

Homophily Outlier Detection in Non-IID Categorical Data

An Optimization Model for Outlier Detection in Categorical Data

Distributed Outlier Detection in Hierarchically Structured Datasets with Mixed Attributes

Human-in-the-loop Outlier Detection.

Purification, characterization and molecular cloning of tyrosinase from the cephalopod mollusk, Illex argentinus.

Information-based Projection Method for Categorical Clustering and Outlier Detection

Detecting outliers by clustering algorithms

Outlier detection using conditional information entropy and rough set theory

Outlier detection for incomplete real-valued data via information entropy and class-consistent technology

Fairness-aware Outlier Ensemble

Sparse Modeling-Based Sequential Ensemble Learning for Effective Outlier Detection in High-Dimensional Numeric Data.

Outlier detection method based on high-density iteration

Deep Clustering based Fair Outlier Detection

NDOD: an Efficient Neighboring Dependent Outlier Detector for Bias Distributed Large Datasets

Multigranulation Relative Entropy-Based Mixed Attribute Outlier Detection in Neighborhood Systems

A neighborhood weighted-based method for the detection of outliers

A fast MST-inspired kNN-based outlier detection method

Outliers Learning And Its Applications

Privacy-Preserving Outlier Detection with High Efficiency over Distributed Datasets

Outlier detection using flexible categorisation and interrogative agendas

Outlier Detection with Cluster Catch Digraphs