Abstract:Minority oversampling is currently one of the most popular and effective methods for handling imbalanced data. However, oversampling that relies on the observations of the minority class to generate new samples is not applicable in the scenario of imbalanced data with extremely scarce minority samples, because the strongly underrepresented minority class does not contain enough information to support the oversampling process. Since some recent studies have exhibited the effectiveness of using majority information to bootstrap oversampling, the neglect of class overlap in the sampling process would increase the overlapping degree and complicate the decision boundary. To this end, this paper proposes a Mahalanobis distance and Local information based OverSampling (MLOS) for highly imbalanced class-overlapped data. MLOS first employs the majority density to guide the sample synthesis, with Mahalanobis distance to extract the majority probability contour. Then for each minority seed sample, to avoid the generation of overlapping samples, MLOS constrain the synthetic process by finding the auxiliary sample (in its 5 nearest neighbors) with similar probability density value to the seed. Finally, MLOS uses a pair-wise data cleaning process to improve the visibility of the decision boundary according to the probability density of synthetic samples. Comparative experiments conducted on 16 highly imbalanced class-overlapped datasets, using 17 different methods, demonstrates the superiority of our proposed method in terms of three popular evaluation metrics AUC , G - mean and Recall for imbalance classification. The source code of MLOS is available at https://github.com/ytyancp/MLOS .

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Multi-label sampling based on local label imbalance

Integrating Unsupervised Clustering and Label-specific Oversampling to Tackle Imbalanced Multi-label Data

A Bayesian Network nearest k-labels method for Multi-label classification

Label correlation guided borderline oversampling for imbalanced multi-label data learning

Nearest neighbors and density-based undersampling for imbalanced data classification with class overlap

Addressing the multi-label imbalance for neural networks: An approach based on stratified mini-batches

Natural local density-based adaptive oversampling algorithm for imbalanced classification

Under-bagging Nearest Neighbors for Imbalanced Classification

PLM: Partial Label Masking for Imbalanced Multi-label Classification

Towards Deeper Insights into Deep Learning from Imbalanced Data.

Synthetic oversampling with Mahalanobis distance and local information for highly imbalanced class-overlapped data

Towards Imbalanced Large Scale Multi-label Classification with Partially Annotated Labels

Trainable Undersampling for Class-Imbalance Learning.

Best First Over-Sampling for Multilabel Classification.

Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data

A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data

AEMLO: AutoEncoder-Guided Multi-Label Oversampling

A Density-based Under-sampling Algorithm for Imbalance Classification

Learning With Noisy Labels Over Imbalanced Subpopulations

A hybrid sampling method for highly imbalanced and overlapped data classification with complex distribution