Abstract:Unsupervised Outlier Detection (UOD) is an important data mining task. With the advance of deep learning, deep Outlier Detection (OD) has received broad interest. Most deep UOD models are trained exclusively on clean datasets to learn the distribution of the normal data, which requires huge manual efforts to clean the real-world data if possible. Instead of relying on clean datasets, some approaches directly train and detect on unlabeled contaminated datasets, leading to the need for methods that are robust to such conditions. Ensemble methods emerged as a superior solution to enhance model robustness against contaminated training sets. However, the training time is greatly increased by the ensemble. In this study, we investigate the impact of outliers on the training phase, aiming to halt training on unlabeled contaminated datasets before performance degradation. Initially, we noted that blending normal and anomalous data causes AUC fluctuations, a label-dependent measure of detection accuracy. To circumvent the need for labels, we propose a zero-label entropy metric named Loss Entropy for loss distribution, enabling us to infer optimal stopping points for training without labels. Meanwhile, we theoretically demonstrate negative correlation between entropy metric and the label-based AUC. Based on this, we develop an automated early-stopping algorithm, EntropyStop, which halts training when loss entropy suggests the maximum model detection capability. We conduct extensive experiments on ADBench (including 47 real datasets), and the overall results indicate that AutoEncoder (AE) enhanced by our approach not only achieves better performance than ensemble AEs but also requires under 2\% of training time. Lastly, our proposed metric and early-stopping approach are evaluated on other deep OD models, exhibiting their broad potential applicability.

Find Important Training Dataset by Observing the Training Sequence Similarity

Supplementary Material: Quasi-Dense Similarity Learning for Multiple Object Tracking

Finding Key Training Data by Calculating Influence Score.

Deep Learning on a Data Diet: Finding Important Examples Early in Training

Do We Train on Test Data? Purging CIFAR of Near-Duplicates

Learn to Forget: Memorization Elimination for Neural Networks.

Data Deletion for Linear Regression with Noisy SGD

Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

Information FOMO: The Unhealthy Fear of Missing Out on Information—A Method for Removing Misleading Data for Healthier Models

A Ranking-Based Cross-Entropy Loss for Early Classification of Time Series.

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

An Efficient Strategy for Catastrophic Forgetting Reduction in Incremental Learning

Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models

Selecting Distinctive-Variant Training Samples Base on Intra-class Similarity

Measuring Forgetting of Memorized Training Examples

Learn to Forget: Machine Unlearning Via Neuron Masking

Fine-tuning can Help Detect Pretraining Data from Large Language Models

Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data.

Importance estimate of features via analysis of their weight and gradient profile

EntropyStop: Unsupervised Deep Outlier Detection with Loss Entropy