Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

Kodjo Mawuena Amekoe,Mustapha Lebbah,Gregoire Jaffre,Hanene Azzag,Zaineb Chelly Dagdia
2024-09-16
Abstract:Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{<a class="link-external link-https" href="https://github.com/anselmeamekoe/DelayedLabelStream" rel="external noopener nofollow">this https URL</a>}
Machine Learning,Computational Engineering, Finance, and Science,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper explores the performance differences between Instance Incremental Learning and Batch Incremental Learning in tabular data streams within a delayed label environment. Specifically, the paper attempts to answer the following questions: 1. **Predictive Performance**: Does Instance Incremental Learning have better predictive performance than Batch Incremental Learning in a delayed label environment? 2. **Computational Efficiency**: Is Instance Incremental Learning more efficient than Batch Incremental Learning in a delayed label environment? 3. **Interpretability**: Which learning method performs better in terms of model interpretability in a delayed label environment? ### Background and Motivation In real-world tabular learning scenarios, data typically arrives continuously in a stream, and its distribution may change over time. In such cases, most research tends to use instance incremental algorithms because they can adapt to changes in data distribution and do not require storing all observed data in memory. However, the design of instance incremental algorithms usually assumes that labels are immediately available, which is not the case in many real-world scenarios. For example, in fraud detection or credit scoring tasks, labels may arrive with a delay. Therefore, batch incremental algorithms are widely used in many practical tasks. ### Research Methodology To answer the above questions, the authors conducted a comprehensive empirical study using real-world datasets (fraud detection problems) and commonly used generated datasets. The specific steps include: 1. **Experimental Framework Design**: Designed a supervised evaluation framework based on interleaved blocks, incorporating label delay. 2. **Algorithm Comparison**: Conducted an empirical comparison of instance incremental and batch incremental algorithms within the designed evaluation framework. 3. **Performance Analysis**: Analyzed the performance of batch incremental models and demonstrated the importance of storing past observed data where possible, especially in tasks with rare target events, such as fraud detection. 4. **Interpretability Analysis**: Explored the advantages of batch incremental models in terms of interpretability. ### Main Findings 1. **Predictive Performance**: The study shows that instance incremental learning is not the best choice in a delayed label environment. When using state-of-the-art models (such as Adaptive Random Forest ARF) and batch learning models (such as XGBoost), batch incremental learning outperforms instance incremental learning in predictive performance. 2. **Computational Efficiency**: Batch incremental learning also excels in computational efficiency, especially when handling large-scale datasets. 3. **Interpretability**: Batch incremental models perform better in terms of interpretability because their architecture or weight update frequency is lower, making them easier to understand and track. ### Conclusion In summary, this paper's empirical study finds that in a delayed label environment, batch incremental learning outperforms instance incremental learning in terms of predictive performance, computational efficiency, and interpretability. This finding is significant for the selection of machine learning models in practical applications, especially in tasks that need to handle delayed label data.