Abstract:Predicting credit default risk is important to financial institutions, as accurately predicting the likelihood of a borrower defaulting on their loans will help to reduce financial losses, thereby maintaining profitability and stability. Although machine learning models have been used in assessing large applications with complex attributes for these predictions, there is still a need to identify the most effective techniques for the model development process, including the technique to address the issue of data imbalance. In this research, we conducted a comparative analysis of random forest, decision tree, SVMs (Support Vector Machines), XGBoost (Extreme Gradient Boosting), ADABoost (Adaptive Boosting) and the multi-layered perceptron, to predict credit defaults using loan data from LendingClub. Additionally, XGBoost was used as a framework for testing and evaluating various techniques. Moreover, we applied this XGBoost framework to handle the issue of class imbalance observed, by testing various resampling methods such as Random Over-Sampling (ROS), the Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Random Under-Sampling (RUS), and hybrid approaches like the SMOTE with Tomek Links and the SMOTE with Edited Nearest Neighbours (SMOTE + ENNs). The results showed that balanced datasets significantly outperformed the imbalanced dataset, with the SMOTE + ENNs delivering the best overall performance, achieving an accuracy of 90.49%, a precision of 94.61% and a recall of 92.02%. Furthermore, ensemble methods such as voting and stacking were employed to enhance performance further. Our proposed model achieved an accuracy of 93.7%, a precision of 95.6% and a recall of 95.5%, which shows the potential of ensemble methods in improving credit default predictions and can provide lending platforms with the tool to reduce default rates and financial losses. In conclusion, the findings from this study have broader implications for financial institutions, offering a robust approach to risk assessment beyond the LendingClub dataset.

Multiple optimized ensemble learning for high-dimensional imbalanced credit scoring datasets

Empirical Analysis of Ensemble Learning for Imbalanced Credit Scoring Datasets: A Systematic Review

Credit Scoring Models Using Ensemble Learning and Classification Approaches: A Comprehensive Survey

A Novel Multi-Stage Ensemble Model With a Hybrid Genetic Algorithm for Credit Scoring on Imbalanced Data

A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification

A novel multi-stage ensemble model with enhanced outlier adaptation for credit scoring

Feature Enhanced Ensemble Modeling with Voting Optimization for Credit Risk Assessment

A New Hybrid Credit Scoring Ensemble Model with Feature Enhancement and Soft Voting Weight Optimization.

XGBoost Optimized by Adaptive Particle Swarm Optimization for Credit Scoring

Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification

Novel hybrid ensemble credit scoring model with stacking-based noise detection and weight assignment

Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models

Classification of Imbalanced Credit scoring data sets Based on Ensemble Method with the Weighted-Hybrid-Sampling

OptimizingEnsemble Learning to Reduce Misclassification Costs in Credit Risk Scorecards

Multi-class imbalanced enterprise credit evaluation based on asymmetric bagging combined with light gradient boosting machine

A multi-level classification based ensemble and feature extractor for credit risk assessment

Bagging Supervised Autoencoder Classifier for Credit Scoring

Empirical Evaluation of Ensemble Learning for Credit Scoring

Ensemble-Based Machine Learning Algorithm for Loan Default Risk Prediction

A novel dynamic ensemble selection classifier for an imbalanced data set: An application for credit risk assessment

A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data