A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Gabriel Aguiar,Bartosz Krawczyk,Alberto Cano
DOI: https://doi.org/10.1007/s10994-023-06353-6
2023-07-18
Abstract:Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from <a class="link-external link-https" href="https://github.com/canoalberto/imbalanced-streams" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of class imbalance in data streams and proposes a standardized, comprehensive, and reproducible experimental framework to evaluate the performance of different algorithms in handling imbalanced data streams. ### Problems the paper attempts to solve: 1. **Standardized Evaluation Framework**: There is currently a lack of a unified standard and benchmark to comprehensively evaluate algorithms for handling imbalanced data streams. This paper proposes a standardized, detailed, and reproducible experimental framework for evaluating the latest algorithms in binary and multi-class imbalanced data streams. 2. **Class Imbalance Challenges**: As data streams continuously change (i.e., concept drift), the problem of class imbalance becomes more complex. The paper explores how to address changes in class proportions and other instance-level difficulties in dynamic environments. 3. **Large-Scale Experimental Study**: By comparing the performance of 24 state-of-the-art data stream algorithms on 515 imbalanced data streams, the paper conducts a large-scale experimental study covering static and dynamic class imbalance ratios, instance-level difficulty, concept drift, and real-world and semi-synthetic datasets. 4. **Recommendations and Future Directions**: Based on the experimental results, the paper provides recommendations for end-users to help select the best algorithms and outlines open challenges and future research directions in the field. Through these efforts, the paper aims to provide researchers with a unified evaluation standard, enabling new algorithms to be compared on a fair, transparent, and reproducible basis.