An automated approach for binary classification on imbalanced data

Pedro Marques Vieira,Fátima Rodrigues
DOI: https://doi.org/10.1007/s10115-023-02046-7
IF: 2.7
2024-01-13
Knowledge and Information Systems
Abstract:Imbalanced data are present in various business sectors and must be handled with the proper resampling methods and classification algorithms. To handle imbalanced data, there are numerous resampling and learning method combinations; nonetheless, their effective use necessitates specialised knowledge. In this paper, several approaches, ranging from more accessible to more advanced in the domain of data resampling techniques, will be considered to handle imbalanced data. The application developed delivers recommendations of the most suitable combinations of techniques for a specific dataset by extracting and comparing dataset meta-feature values recorded in a knowledge base. It facilitates effortless classification and automates part of the machine learning pipeline with comparable or better results than state-of-the-art solutions and with a much smaller execution time.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of imbalanced data classification in machine learning. Imbalanced data is prevalent in various business domains such as telecommunications, bioinformatics, fraud detection, and medical diagnostics. A primary characteristic of imbalanced datasets is that the number of samples in one class is significantly lower than in other classes, posing challenges to traditional classification algorithms. To tackle this problem, the paper proposes an automated approach to handle imbalanced data by combining different resampling techniques and classification algorithms. #### Main Objectives - Develop a system that automatically prepares imbalanced datasets for classifier use. - Record the best combinations of resampling techniques, classification algorithms, and dataset meta-features by evaluating various combinations. - Provide a recommendation mechanism that suggests the most suitable combination of resampling and classification techniques based on the characteristics of a new dataset. #### Methodology - **Dataset Selection**: 65 imbalanced datasets from various domains were selected from public data sources such as the UCI Machine Learning Repository, KEEL, and OpenML. - **Meta-feature Extraction**: Meta-features were extracted from the original datasets to analyze the complexity, concepts, and statistical properties of the datasets. - **Evaluation Metrics**: Multiple performance metrics suitable for imbalanced datasets were used, including Balanced Accuracy, F1 score, ROC AUC, Geometric Mean, and Cohen's Kappa coefficient. - **Resampling and Classification Algorithms**: 19 resampling algorithms (including oversampling, undersampling, and hybrid sampling) and various classification algorithms were tested to find the best combinations. Through this approach, the paper aims to develop an automated tool to help users, especially those with less experience, more easily handle imbalanced datasets and improve the performance of classification tasks.