Unbalanced Credit Card Fraud Detection Data: A Machine Learning-Oriented Comparative Study of Balancing Techniques

Palak Gupta,Anmol Varshney,Mohammad Rafeek Khan,Rafeeq Ahmed,Mohammed Shuaib,Shadab Alam
DOI: https://doi.org/10.1016/j.procs.2023.01.231
2023-02-01
Procedia Computer Science
Abstract:The number of individuals who use credit cards has increased dramatically in recent decades, as has the volume of credit card fraud transactions. Consequently, banks and credit card companies must be able to classify fraudulent credit card transactions so that clients do not have to pay for products they did not purchase. Data Science can easily tackle such challenges, and the value of Machine Learning methodologies cannot be emphasized. The study demonstrates how to model utilizing multiple classifiers and data balance using machine learning approaches to learning about Credit Card Fraud Detection. The data has been observed as an imbalanced dataset that could have inferred not much optimal performance of models. The experimentation on the imbalanced data has been done and observed that XGBoost has yielded good performance with 0.91 precision score and 0.99 accuracy score. The different sampling techniques have been carried out in procedure so as to enhance the scores in terms of precision, recall, f1-score, and accuracy. The Random Oversampling technique has come out to be the best suited technique over the imbalance data and yields 0.99 precision and 0.99 accuracy score, when applied on the best model i.e., XGBoost. The models are then used to compare the results of all of the classifiers employed, resulting in varied conclusions and further research. While working on the study, many data balancing procedures such as oversampling, under sampling, and SMOTE are used, with XGBoost beating residual algorithms with a 99% accuracy score and precision score when Random Over-Sampling is considered. The research suggested the use of data sampling techniques to balance data over the algorithms that show best results under the imbalanced data scenarios, to conclude the best possible performance of the model for fraudulent activities classification.
English Else
What problem does this paper attempt to address?