A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

Junlan Chen,Ziyuan Pu,Nan Zheng,Xiao Wen,Hongliang Ding,Xiucheng Guo
2024-04-03
Abstract:Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper focuses on how to address the problem of data imbalance in modeling the severity of traffic accidents. Data imbalance refers to the fact that most accidents are non-fatal, while fatal accidents are in the minority. This poses a challenge to model training because the limited number of fatal accident samples makes it difficult for the model to predict accurately. Traditional methods such as undersampling and oversampling have their limitations, which may result in information loss or overfitting. The paper proposes a traffic accident data generation method based on Conditional Table GAN (CTGAN). CTGAN is capable of handling both continuous and discrete variables and is particularly suitable for dealing with sparse discrete risk factors. After balancing the data using CTGAN, an accident severity model is built to evaluate classification and interpretability performance. The study compares the performance of different resampling methods (including oversampling, undersampling, and mixed sampling) on an imbalanced accident dataset from Washington State for four years, as well as the performance of parameter and probability estimation in two-class and three-class imbalanced scenarios using Monte Carlo simulation. The results show that modeling accident severity using data generated by CTGAN-RU outperforms other methods. The main contributions of the paper are as follows: 1. The development of a CTGAN-based data generation method that can handle both continuous and discrete risk factors, especially addressing the sparsity problem of discrete variables. 2. Empirical research and Monte Carlo simulation to evaluate the distribution consistency and parameter recovery ability of synthetic samples, covering different imbalance scenarios. 3. Comparison of different resampling methods (such as SMOTE-NC, TVAE, random undersampling RU, etc.) with the proposed generation method, demonstrating the superiority of CTGAN-RU. The organization of the paper includes a literature review, method description (including CTGAN, baseline resampling methods, accident severity model, and evaluation metrics), data preparation, model performance analysis, and model performance evaluation, followed by conclusions and suggestions for future research.