Incremental Outlier Detection Modelling Using Streaming Analytics in Finance & Health Care

Ch Priyanka,Vivek
2023-05-17
Abstract:In this paper, we had built the online model which are built incrementally by using online outlier detection algorithms under the streaming environment. We identified that there is highly necessity to have the streaming models to tackle the streaming data. The objective of this project is to study and analyze the importance of streaming models which is applicable in the real-world environment. In this work, we built various Outlier Detection (OD) algorithms viz., One class Support Vector Machine (OC-SVM), Isolation Forest Adaptive Sliding window approach (IForest ASD), Exact Storm, Angle based outlier detection (ABOD), Local outlier factor (LOF), KitNet, KNN ASD methods. The effectiveness and validity of the above-built models on various finance problems such as credit card fraud detection, churn prediction, ethereum fraud prediction. Further, we also analyzed the performance of the models on the health care prediction problems such as heart stroke prediction, diabetes prediction and heart stroke prediction problems. As per the results and dataset it shows that it performs well for the highly imbalanced datasets that means there is a majority of negative class and minority will be the positive class. Among all the models, the ensemble model strategy IForest ASD model performed better in most of the cases standing in the top 3 models in almost all of the cases.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the fields of finance and healthcare, how to use stream data analysis techniques for incremental outlier detection (OD). Specifically, the author aims to study and analyze the importance of models applicable to real - time data streams, and has developed an online incremental outlier detection framework to deal with problems in the fields of finance (such as credit card fraud detection, customer churn prediction, Ethereum fraud prediction) and healthcare (such as heart attack prediction, diabetes prediction). ### Main problems: 1. **Challenges in processing stream data**: Unlike traditional batch processing, stream data is continuously generated and requires real - time processing. Therefore, traditional offline models cannot effectively handle such dynamically changing data. 2. **Processing highly imbalanced data sets**: In many practical application scenarios, data is often highly imbalanced, that is, there are far more negative - class samples than positive - class samples. For example, in fraud detection, normal transactions are far more numerous than fraudulent transactions. How to maintain the effectiveness of the model in this situation is a key issue. 3. **Selecting appropriate outlier detection algorithms**: The author compared multiple outlier detection algorithms (such as One Class SVM, Isolation Forest ASD, Exact Storm, etc.) and evaluated their performance in a stream - data environment. ### Solutions: - **Incremental modeling**: Through the sliding - window method, the model can be continuously updated when new data is received, thus adapting to the dynamic changes of the data. - **Online learning**: The model is trained while receiving new data, ensuring that abnormal behaviors in the data can be captured in a timely manner. - **Performance evaluation**: By comparing the performance of online and offline models, the superiority of the incremental model in processing stream data is verified. ### Experimental results: The experimental results show that for highly imbalanced data sets (such as stroke prediction data sets and Ethereum fraud detection data sets), the incremental model (Scenario 2) performs better than the offline model (Scenario 1). In particular, the Isolation Forest ASD model performs best on such data sets and is usually among the top three of all models. ### Summary: The main contribution of this paper is to propose an incremental outlier detection framework applicable to stream data and prove its effectiveness in handling highly imbalanced data sets through experiments. Future work can further optimize the robustness of the model and improve the detection accuracy by integrating multiple models.