A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Gabriel Aguiar,Bartosz Krawczyk,Alberto Cano

DOI: https://doi.org/10.1007/s10994-023-06353-6

2023-07-18

Abstract:Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from <a class="link-external link-https" href="https://github.com/canoalberto/imbalanced-streams" rel="external noopener nofollow">this https URL</a>.

Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of class imbalance in data streams and proposes a standardized, comprehensive, and reproducible experimental framework to evaluate the performance of different algorithms in handling imbalanced data streams. ### Problems the paper attempts to solve: 1. **Standardized Evaluation Framework**: There is currently a lack of a unified standard and benchmark to comprehensively evaluate algorithms for handling imbalanced data streams. This paper proposes a standardized, detailed, and reproducible experimental framework for evaluating the latest algorithms in binary and multi-class imbalanced data streams. 2. **Class Imbalance Challenges**: As data streams continuously change (i.e., concept drift), the problem of class imbalance becomes more complex. The paper explores how to address changes in class proportions and other instance-level difficulties in dynamic environments. 3. **Large-Scale Experimental Study**: By comparing the performance of 24 state-of-the-art data stream algorithms on 515 imbalanced data streams, the paper conducts a large-scale experimental study covering static and dynamic class imbalance ratios, instance-level difficulty, concept drift, and real-world and semi-synthetic datasets. 4. **Recommendations and Future Directions**: Based on the experimental results, the paper provides recommendations for end-users to help select the best algorithms and outlines open challenges and future research directions in the field. Through these efforts, the paper aims to provide researchers with a unified evaluation standard, enabling new algorithms to be compared on a fair, transparent, and reproducible basis.

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Online Fairness-Aware Learning with Imbalanced Data Streams

Challenges in Benchmarking Stream Learning Algorithms with Real-world Data

A Framework of Online Learning with Imbalanced Streaming Data.

A comprehensive ensemble classification techniques detecting and managing concept drift in dynamic imbalanced data streams

stream-learn -- open-source Python library for difficult data stream batch analysis

A comprehensive active learning method for multiclass imbalanced data streams with concept drift

Intensive Class Imbalance Learning in Drifting Data Streams

Learning from Data Streams: An Overview and Update

Imbalanced Data Stream Classification using Dynamic Ensemble Selection

Data Stream Classification with Novel Class Detection: a Review, Comparison and Challenges.

A survey on imbalanced learning: latest research, applications and future directions

Online Learning From Incomplete and Imbalanced Data Streams

A literature survey on various aspect of class imbalance problem in data mining

Rarity updated ensemble with oversampling: An ensemble approach to classification of imbalanced data streams

Improving Online Bagging for Complex Imbalanced Data Stream

A Hybrid Active-Passive Approach to Imbalanced Nonstationary Data Stream Classification

Online ensemble learning algorithm for imbalanced data stream

Resampling strategies for imbalanced regression: a survey and empirical analysis

Emril:Ensemble Method Based on Reinforcement Learning for Binary Classification in Imbalanced Drifting Data Streams

Standardized Evaluation of Machine Learning Methods for Evolving Data Streams