Abstract:Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods.

What problem does this paper attempt to address?

This paper focuses on how to address the problem of data imbalance in modeling the severity of traffic accidents. Data imbalance refers to the fact that most accidents are non-fatal, while fatal accidents are in the minority. This poses a challenge to model training because the limited number of fatal accident samples makes it difficult for the model to predict accurately. Traditional methods such as undersampling and oversampling have their limitations, which may result in information loss or overfitting. The paper proposes a traffic accident data generation method based on Conditional Table GAN (CTGAN). CTGAN is capable of handling both continuous and discrete variables and is particularly suitable for dealing with sparse discrete risk factors. After balancing the data using CTGAN, an accident severity model is built to evaluate classification and interpretability performance. The study compares the performance of different resampling methods (including oversampling, undersampling, and mixed sampling) on an imbalanced accident dataset from Washington State for four years, as well as the performance of parameter and probability estimation in two-class and three-class imbalanced scenarios using Monte Carlo simulation. The results show that modeling accident severity using data generated by CTGAN-RU outperforms other methods. The main contributions of the paper are as follows: 1. The development of a CTGAN-based data generation method that can handle both continuous and discrete risk factors, especially addressing the sparsity problem of discrete variables. 2. Empirical research and Monte Carlo simulation to evaluate the distribution consistency and parameter recovery ability of synthetic samples, covering different imbalance scenarios. 3. Comparison of different resampling methods (such as SMOTE-NC, TVAE, random undersampling RU, etc.) with the proposed generation method, demonstrating the superiority of CTGAN-RU. The organization of the paper includes a literature review, method description (including CTGAN, baseline resampling methods, accident severity model, and evaluation metrics), data preparation, model performance analysis, and model performance evaluation, followed by conclusions and suggestions for future research.

A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

Crash injury severity prediction considering data imbalance: A Wasserstein generative adversarial network with gradient penalty approach

PCA-Based Missing Information Imputation for Real-Time Crash Likelihood Prediction under Imbalanced Data.

Efficient Generative Adversarial Networks for Imbalanced Traffic Collision Datasets

A crash occurrence risk prediction model based on variational autoencoder and generative adversarial network

Classification of autonomous vehicle crash severity: Solving the problems of imbalanced datasets and small sample size

Crash Severity Risk Modeling Strategies under Data Imbalance

CGAN-EB: A Non-parametric Empirical Bayes Method for Crash Hotspot Identification Using Conditional Generative Adversarial Networks: A Simulated Crash Data Study

Transfer learning for spatio-temporal transferability of real-time crash prediction models

Examining imbalanced classification algorithms in predicting real-time traffic crash risk

Real-Time Crash Risk Prediction using Long Short-Term Memory Recurrent Neural Network

Identification of Significant Factors in Fatal-Injury Highway Crashes Using Genetic Algorithm and Neural Network

Data Augmentation Classifier for Imbalanced Fault Classification

Model-based generation of representative rear-end crash scenarios across the full severity range using pre-crash data

Real-time driving risk assessment using deep learning with XGBoost

Applications of machine learning methods in traffic crash severity modelling: current status and future directions

Short-Term Segment-Level Crash Risk Prediction Using Advanced Data Modeling with Proactive and Reactive Crash Data

Ensemble Data Augmentation for Imbalanced Fault Diagnosis.

Crash Data Augmentation Using Conditional Generative Adversarial Networks (CGAN) for Improving Safety Performance Functions

A Crash Severity Prediction Method Based on Improved Neural Network and Factor Analysis

Traffic Accident Data Generation Based on Improved Generative Adversarial Networks