Abstract:Outlier detection is critically important in the field of data mining. Real-world data have the impreciseness and ambiguity which can be handled by means of rough set theory. Information entropy is an effective way to measure the uncertainty in an information system. Most outlier detection methods may be called unsupervised outlier detection because they are only dealt with unlabeled data. When sufficient labeled data are available, these methods are used in a decision information system, which means that the decision attribute is discarded. Thus, these methods maybe not right for outlier detection in a a decision information system. This paper proposes supervised outlier detection using conditional information entropy and rough set theory. Firstly, conditional information entropy in a decision information system based on rough set theory is calculated, which provides a more comprehensive measure of uncertainty. Then, the relative entropy and relative cardinality are put forward. Next, the degree of outlierness and weight function are presented to find outlier factors. Finally, a conditional information entropy-based outlier detection algorithm is given. The performance of the given algorithm is evaluated and compared with the existing outlier detection algorithms such as LOF, KNN, Forest, SVM, IE, and ECOD. Twelve data sets have been taken from UCI to prove its efficiency and performance. For example, the AUC value of CIE algorithm in the Hayes data set is 0.949, and the AUC values of LOF, KNN, SVM, Forest, IE and ECOD algorithms in the Hayes data set are 0.647, 0.572, 0.680, 0.676, 0.928 and 0.667, respectively. The advantage of the proposed outlier detection method is that it fully utilizes the decision information.

Outlier Mining of the High-dimension Datasets Based on Information Theory

Outlier detection using conditional information entropy and rough set theory

Mining Query-Based Subnetwork Outliers in Heterogeneous Information Networks.

Purification, characterization and molecular cloning of tyrosinase from the cephalopod mollusk, Illex argentinus.

Detecting Outliers in High Dimensional Data Sets using Z-Score Methodology

Robust Subspace Outlier Detection in High Dimensional Space

A new unsupervised outlier detection method

Outlier detection for incomplete real-valued data via information entropy and class-consistent technology

An Outlier Detection Method in High Dimensional Time Series Based on LLM

An Information-Theoretic Approach to Unsupervised Feature Selection for High-Dimensional Data

Semi-supervised Hierarchical Clustering Analysis for High Dimensional Data

Markov Boundary-Based Outlier Mining

Distance-based outlier detection for high dimension, low sample size data

Outlier detection method based on high-density iteration

A Fast Outlier Detection Method for Big Data.

Outlier Analysis for Gene Expression Data

A rough set based clustering algorithm and the information theoretical approach to refine clusters

Multivalued Subsets Under Information Theory

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Info-Detection: An Information-Theoretic Approach To Detect Outlier

Research Progress on Outlier Mining