Abstract:Imbalanced data are present in various business sectors and must be handled with the proper resampling methods and classification algorithms. To handle imbalanced data, there are numerous resampling and learning method combinations; nonetheless, their effective use necessitates specialised knowledge. In this paper, several approaches, ranging from more accessible to more advanced in the domain of data resampling techniques, will be considered to handle imbalanced data. The application developed delivers recommendations of the most suitable combinations of techniques for a specific dataset by extracting and comparing dataset meta-feature values recorded in a knowledge base. It facilitates effortless classification and automates part of the machine learning pipeline with comparable or better results than state-of-the-art solutions and with a much smaller execution time.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of imbalanced data classification in machine learning. Imbalanced data is prevalent in various business domains such as telecommunications, bioinformatics, fraud detection, and medical diagnostics. A primary characteristic of imbalanced datasets is that the number of samples in one class is significantly lower than in other classes, posing challenges to traditional classification algorithms. To tackle this problem, the paper proposes an automated approach to handle imbalanced data by combining different resampling techniques and classification algorithms. #### Main Objectives - Develop a system that automatically prepares imbalanced datasets for classifier use. - Record the best combinations of resampling techniques, classification algorithms, and dataset meta-features by evaluating various combinations. - Provide a recommendation mechanism that suggests the most suitable combination of resampling and classification techniques based on the characteristics of a new dataset. #### Methodology - **Dataset Selection**: 65 imbalanced datasets from various domains were selected from public data sources such as the UCI Machine Learning Repository, KEEL, and OpenML. - **Meta-feature Extraction**: Meta-features were extracted from the original datasets to analyze the complexity, concepts, and statistical properties of the datasets. - **Evaluation Metrics**: Multiple performance metrics suitable for imbalanced datasets were used, including Balanced Accuracy, F1 score, ROC AUC, Geometric Mean, and Cohen's Kappa coefficient. - **Resampling and Classification Algorithms**: 19 resampling algorithms (including oversampling, undersampling, and hybrid sampling) and various classification algorithms were tested to find the best combinations. Through this approach, the paper aims to develop an automated tool to help users, especially those with less experience, more easily handle imbalanced datasets and improve the performance of classification tasks.

An automated approach for binary classification on imbalanced data

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

Handling Imbalanced Data: A Case Study for Binary Class Problems

Resampling approach for imbalanced data classification based on class instance density per feature value intervals

A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques

Value-Aware Resampling and Loss for Imbalanced Classification

A Bilevel Optimization Framework for Imbalanced Data Classification

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Resampling strategies for imbalanced regression: a survey and empirical analysis

Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification

An empirical evaluation of sampling methods for the classification of imbalanced data

The imbalance problem: A comparison of sampling approaches using different parameters and feature selection methods in the context of classification

Hybrid SVM algorithm oriented to classifying imbalanced datasets

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

A cluster impurity-based hybrid resampling for imbalanced classification problems