Mohammed Temraz,Mark T. Keane
Abstract:Learning from class imbalanced datasets poses challenges for many machine learning algorithms. Many real-world domains are, by definition, class imbalanced by virtue of having a majority class that naturally has many more instances than its minority class (e.g. genuine bank transactions occur much more often than fraudulent ones). Many methods have been proposed to solve the class imbalance problem, among the most popular being oversampling techniques (such as SMOTE). These methods generate synthetic instances in the minority class, to balance the dataset, performing data augmentations that improve the performance of predictive machine learning (ML) models. In this paper we advance a novel data augmentation method (adapted from eXplainable AI), that generates synthetic, counterfactual instances in the minority class. Unlike other oversampling techniques, this method adaptively combines exist-ing instances from the dataset, using actual feature-values rather than interpolating values between instances. Several experiments using four different classifiers and 25 datasets are reported, which show that this Counterfactual Augmentation method (CFA) generates useful synthetic data points in the minority class. The experiments also show that CFA is competitive with many other oversampling methods many of which are variants of SMOTE. The basis for CFAs performance is discussed, along with the conditions under which it is likely to perform better or worse in future tests.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the classification problem in class - imbalanced datasets**, especially in the case where the number of minority - class samples is far less than that of majority - class samples, how to improve the performance of machine - learning models by generating synthetic minority - class samples. Specifically, the author proposes a data - augmentation technique based on the counterfactual method, called **Counterfactual Augmentation (CFA)**, to address this challenge.
### Problem Background
In many real - world applications, such as credit - card - fraud detection, medical diagnosis, text classification, etc., datasets usually have the problem of class imbalance. For example, in credit - card transactions, normal transactions are far more numerous than fraudulent ones. This imbalance can cause machine - learning models to perform poorly in predicting the minority class, and may even result in a falsely high accuracy rate (because the model mainly relies on correct predictions of the majority class), thus affecting the rule - induction ability of models such as decision trees.
### Solution
To solve this problem, existing methods include:
1. **Random Over - Sampling (ROS)**: Simply replicate minority - class samples.
2. **Random Under - Sampling (RUS)**: Randomly delete majority - class samples.
3. **Synthetic Minority Over - sampling Technique (SMOTE)**: Generate new minority - class samples through interpolation.
However, these methods each have their own advantages and disadvantages. ROS may lead to over - fitting, RUS may lose important information, and although SMOTE avoids simple replication, it may introduce noise or generate samples outside the distribution.
### Innovations Proposed in the Paper
The author proposes a new method based on **counterfactual reasoning** - **Counterfactual Augmentation (CFA)**. The core idea of this method is to generate new minority - class samples using actual feature values rather than interpolation. Specifically, CFA works through the following steps:
1. **Finding Counterfactual Pairs**: Find a pair of instances in the dataset that are the same in most features but differ in key features, causing them to belong to different classes. For example, in a loan application, two applicants have the same conditions except for their incomes, one is approved for the loan and the other is rejected.
2. **Generating Synthetic Samples**: Based on these counterfactual pairs, generate new minority - class samples. The feature values of the new samples come from the actual values of existing instances, rather than being obtained through interpolation.
### Experimental Results
The paper verifies the effectiveness of CFA through multiple experiments. The experiments use four different classifiers and 25 datasets, and the results show that the synthetic samples generated by CFA can effectively improve the prediction performance of the minority class, and in many cases are superior to other over - sampling methods, such as SMOTE and its variants.
### Conclusion
CFA provides a novel and effective solution for solving the class - imbalance problem. By generating reasonable counterfactual samples, CFA not only improves the performance of the model, but also avoids the problems that traditional methods may bring, such as over - fitting or noise introduction. Future research can further explore the application of CFA in more fields and more complex datasets.