Yancheng Liang,Jiajie Zhang,Hui Li,Xiaochen Liu,Yi Hu,Yong Wu,Jinyao Zhang,Yongyan Liu,Yi Wu
Abstract:Despite the tremendous advances achieved over the past years by deep learning techniques, the latest risk prediction models for industrial applications still rely on highly handtuned stage-wised statistical learning tools, such as gradient boosting and random forest methods. Different from images or languages, real-world financial data are high-dimensional, sparse, noisy and extremely imbalanced, which makes deep neural network models particularly challenging to train and fragile in practice. In this work, we propose DeRisk, an effective deep learning risk prediction framework for credit risk prediction on real-world financial data. DeRisk is the first deep risk prediction model that outperforms statistical learning approaches deployed in our company's production system. We also perform extensive ablation studies on our method to present the most critical factors for the empirical success of DeRisk.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to conduct credit risk prediction on actual financial data. Specifically, existing credit risk prediction models mainly rely on highly manually - adjusted statistical learning tools, such as gradient boosting and random forest methods. However, these methods do not perform ideally when dealing with high - dimensional, sparse, noisy, and extremely unbalanced real - world financial data. Therefore, the paper proposes an effective deep - learning framework - DeRisk, aiming to improve the accuracy of credit risk prediction through deep neural network models and ultimately outperform existing statistical learning methods.
### Main Contributions
1. **Developed a comprehensive workflow**: This workflow takes into account all aspects of model training in risk prediction.
2. **Implemented the DeRisk framework**: This is the first deep - risk prediction model that outperforms statistical learning methods on real - world financial data.
3. **Conducted extensive ablation studies**: These studies provide useful insights and practical suggestions, which are of great significance to the research community and relevant practitioners.
### Method Overview
1. **Overall Process**:
- **Data Pre - processing**: Carefully process the input features and convert them into a structured format for the training of the deep network.
- **Train Non - sequential and Sequential Models Separately**: Designed two main sub - models, a DNN model for processing non - sequential features and another Transformer - based model for processing sequential features.
- **Joint Fine - tuning**: Connect the output hidden layers of the two sub - models and apply another linear head to generate the final prediction score, and then perform joint fine - tuning on the entire model to improve performance.
2. **Label Selection**:
- Selected long - term labels for training because long - term labels are more balanced than short - term labels and are less sensitive to changes in the time distribution, thereby improving the model's generalization ability.
3. **Data Pre - processing**:
- Normalized and processed time features, numerical features, and categorical features.
- Adopted feature selection techniques and used XGBoost to select the 500 most important non - sequential numerical features.
- Specifically dealt with missing values and outliers and retained the information of meaningful 0 and NaN values.
4. **Modeling Non - sequential Features**:
- Used a simple but effective neural network architecture, including an embedding layer, a multi - layer perceptron (MLP), and a sigmoid activation function.
5. **Modeling Sequential Features**:
- Adopted a Transformer - based model and used time embedding and attention mechanisms to process sequential data.
- Introduced masked language model (MLM) pre - training to accelerate the training of the sequential model.
6. **Weighted BCE Loss**:
- To deal with the data imbalance problem, adopted a weighted binary cross - entropy (BCE) loss function and used oversampling techniques in the sequential model.
7. **Separate Training and Joint Fine - tuning**:
- Adopted a two - stage training strategy: first train the non - sequential and sequential models separately, and then perform joint fine - tuning to fully utilize the sequential features.
### Experimental Results
- **Baseline Models**: Include XGBoost, DeepFM, DCN, AutoInt, etc.
- **Experimental Setup**: Used credit report data and repayment behavior data of 582,996 users from August 2020 to July 2021.
- **Evaluation Metrics**: Mainly used the AUC (area under the ROC curve) score for evaluation.
- **Main Results**:
- DeRisk's non - sequential model DNN and sequential model MLM + Transformer both outperformed all baseline models.
- The performance of the jointly fine - tuned model is better than that of the non - sequential or sequential model used alone.
- Complex models do not necessarily perform better.
Through these methods and experiments, the paper demonstrates the effectiveness and superiority of the DeRisk framework on actual financial data.