Feature Selection and Sensitivity Analysis of Oversampling in Big and Highly Imbalanced Bank's Credit Data

Nimas Sefrida Andriaswuri,Harry Patria,M. A. Nafis,Aznovri Kurniawan,Ahmad Rifa'i,D. Purwitasari
DOI: https://doi.org/10.1109/ICoICT55009.2022.9914889
2022-08-02
Abstract:Machine learning has evolved as a multidisciplinary study in the last few years and gains more popularity in big data analytics, including in the banking industry. Numerous methods can be used in predictive analytics through supervised machine learning, either for regression or classification problems. In the banking industry, credit quality is one of the core focuses, since it is one of the main areas that is reviewed regularly by regulators and impacts banks' profitability. This research is intended to give recommendations on how to select appropriate machine learning technique, perform feature selection and sensitivity analysis on bank's credit data with more than one million records and highly imbalanced, i.e., 97.5% of data is at one category. By using several supervised machine learning classification methods including the application of SMOTE (synthetic minority oversampling technique), computational results are compared and summarized, resulting in recommendations on the most appropriate technique for big and extremely imbalanced datasets, i.e., the Tree Ensemble method with SMOTE, with the computational issue is solved through data sampling, without significantly reducing its accuracy. It is also concluded that optimum number of features will increase model accuracy, however significant reduction of number of features will not necessarily increase model accuracy. The research is expected to be useful for the banking industry, especially in credit portfolio analytics, or other industries with a big and imbalanced dataset, to perform predictive analytics to support business objectives. Further research is possible, to cover more in-depth analytics for the decision-making process in banking.
Economics,Business,Computer Science
What problem does this paper attempt to address?