Abstract:Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{<a class="link-external link-https" href="https://github.com/anselmeamekoe/DelayedLabelStream" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper explores the performance differences between Instance Incremental Learning and Batch Incremental Learning in tabular data streams within a delayed label environment. Specifically, the paper attempts to answer the following questions: 1. **Predictive Performance**: Does Instance Incremental Learning have better predictive performance than Batch Incremental Learning in a delayed label environment? 2. **Computational Efficiency**: Is Instance Incremental Learning more efficient than Batch Incremental Learning in a delayed label environment? 3. **Interpretability**: Which learning method performs better in terms of model interpretability in a delayed label environment? ### Background and Motivation In real-world tabular learning scenarios, data typically arrives continuously in a stream, and its distribution may change over time. In such cases, most research tends to use instance incremental algorithms because they can adapt to changes in data distribution and do not require storing all observed data in memory. However, the design of instance incremental algorithms usually assumes that labels are immediately available, which is not the case in many real-world scenarios. For example, in fraud detection or credit scoring tasks, labels may arrive with a delay. Therefore, batch incremental algorithms are widely used in many practical tasks. ### Research Methodology To answer the above questions, the authors conducted a comprehensive empirical study using real-world datasets (fraud detection problems) and commonly used generated datasets. The specific steps include: 1. **Experimental Framework Design**: Designed a supervised evaluation framework based on interleaved blocks, incorporating label delay. 2. **Algorithm Comparison**: Conducted an empirical comparison of instance incremental and batch incremental algorithms within the designed evaluation framework. 3. **Performance Analysis**: Analyzed the performance of batch incremental models and demonstrated the importance of storing past observed data where possible, especially in tasks with rare target events, such as fraud detection. 4. **Interpretability Analysis**: Explored the advantages of batch incremental models in terms of interpretability. ### Main Findings 1. **Predictive Performance**: The study shows that instance incremental learning is not the best choice in a delayed label environment. When using state-of-the-art models (such as Adaptive Random Forest ARF) and batch learning models (such as XGBoost), batch incremental learning outperforms instance incremental learning in predictive performance. 2. **Computational Efficiency**: Batch incremental learning also excels in computational efficiency, especially when handling large-scale datasets. 3. **Interpretability**: Batch incremental models perform better in terms of interpretability because their architecture or weight update frequency is lower, making them easier to understand and track. ### Conclusion In summary, this paper's empirical study finds that in a delayed label environment, batch incremental learning outperforms instance incremental learning in terms of predictive performance, computational efficiency, and interpretability. This finding is significant for the selection of machine learning models in practical applications, especially in tasks that need to handle delayed label data.

Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

Label Delay in Online Continual Learning

Concept drift and machine learning model for detecting fraudulent transactions in streaming environment

Incremental Feature Learning For Infinite Data

Predictive Modeling with Delayed Information: a Case Study in E-commerce Transaction Fraud Control

Deep incremental learning models for financial temporal tabular datasets with distribution shifts

Towards An Online Incremental Approach to Predict Students Performance

Learning High-Dimensional Evolving Data Streams with Limited Labels

Feature Selection in the Data Stream Based on Incremental Markov Boundary Learning

Novelty Detection and Online Learning for Chunk Data Streams

Autoencoder-based Anomaly Detection in Streaming Data with Incremental Learning and Concept Drift Adaptation

Incremental Learning Imbalanced Data Streams with Concept Drift: the Dynamic Updated Ensemble Algorithm

Act Now: A Novel Online Forecasting Framework for Large-Scale Streaming Data

Streaming Active Learning Strategies for Real-Life Credit Card Fraud Detection: Assessment and Visualization

A new online learning algorithm for streaming data and decision support with a Bayesian approach

Recommendation of data-free class-incremental learning algorithms by simulating future data

Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis

Iterative Forgetting: Online Data Stream Regression Using Database-Inspired Adaptive Granulation

Clustering-based Active Learning Classification towards Data Stream

Adaptive Chunk-Based Dynamic Weighted Majority for Imbalanced Data Streams With Concept Drift

Label Augmentation via Time-based Knowledge Distillation for Financial Anomaly Detection