Bond Default Prediction with Text Embeddings, Undersampling and Deep Learning

Luke Jordan
DOI: https://doi.org/10.48550/arXiv.2110.07035
2021-10-14
Abstract:The special and important problems of default prediction for municipal bonds are addressed using a combination of text embeddings from a pre-trained transformer network, a fully connected neural network, and synthetic oversampling. The combination of these techniques provides significant improvement in performance over human estimates, linear models, and boosted ensemble models, on data with extreme imbalance. Less than 0.2% of municipal bonds default, but our technique predicts 9 out of 10 defaults at the time of issue, without using bond ratings, at a cost of false positives on less than 0.1% non-defaulting bonds. The results hold the promise of reducing the cost of capital for local public goods, which are vital for society, and bring techniques previously used in personal credit and public equities (or national fixed income), as well as the current generation of embedding techniques, to sub-sovereign credit decisions.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the default prediction of U.S. municipal bonds. Specifically, the author aims to significantly improve the accuracy of municipal bond default prediction by combining text embedding, undersampling, and deep - learning techniques, especially in the case of extremely unbalanced data (i.e., the default rate is extremely low, approximately 0.1%). The following are the specific objectives of this study: 1. **Improve prediction performance**: Significantly improve prediction performance by using text embeddings generated by pre - trained Transformer networks, fully - connected neural networks, and synthetic oversampling techniques, surpassing human estimates, linear models, and enhanced ensemble models. 2. **Do not rely on credit ratings**: Do not use bond ratings or time - series data during the prediction process, but only rely on publicly and easily accessible data, such as bond purposes, maturities, geographical locations, and national macro - economic data. 3. **Reduce capital costs**: By making more accurate default predictions, reduce the capital costs of local governments when financing public projects, thereby providing more public goods and services to society. 4. **Transparency and interpretability**: Enable ordinary people to better understand the risks and potential pricing of urban - issued debts, increasing market transparency. 5. **Expand application areas**: Apply these techniques to other types of credit decisions, especially in the fields of bonds and other local public credits, providing a basis for future research and applications. ### Formulas and methods To deal with the problem of extremely unbalanced data, the author adopted the following techniques: - **Text embedding**: Use a pre - trained Siamese BERT network to generate embedding vectors for project descriptions with a dimension of \(d = 384\). - **Synthetic oversampling**: Use SMOTE - ENC (Synthetic Minority Over - sampling Technique for ENcoded Categorical and Continuous features) to handle categorical and continuous features. - **Weighted random sampling**: Conduct weighted random sampling for each batch during the training process to balance the sample distribution in the training set. - **Multilayer perceptron (MLP)**: Construct a neural network with four hidden layers, with the hidden layer sizes being \([128, 256, 64, 8]\), and use a dropout rate of 0.1 and a batch size of 256. ### Results Through the above methods, the author achieved significant improvements on the test set. The specific results are as follows: - **PR AUC** (Precision - Recall Area Under Curve): 0.967 - **KS statistic** (Kolmogorov - Smirnov two - sample statistic): 227 - **False positive rate**: 0.06% - **False negative rate**: 0.9% These results indicate that the proposed method performs well in predicting municipal bond defaults, especially on extremely unbalanced data sets. ### Summary This study shows how to use advanced machine - learning and natural - language - processing techniques to improve municipal - bond - default prediction, thereby bringing practical benefits to local governments and society. At the same time, the study also provides a valuable reference for future applications in other credit fields.