Density-Aware Personalized Training for Risk Prediction in Imbalanced Medical Data

Zepeng Huo,Xiaoning Qian,Shuai Huang,Zhangyang Wang,Bobak J. Mortazavi
DOI: https://doi.org/10.48550/arXiv.2207.11382
2022-07-30
Abstract:Medical events of interest, such as mortality, often happen at a low rate in electronic medical records, as most admitted patients survive. Training models with this imbalance rate (class density discrepancy) may lead to suboptimal prediction. Traditionally this problem is addressed through ad-hoc methods such as resampling or reweighting but performance in many cases is still limited. We propose a framework for training models for this imbalance issue: 1) we first decouple the feature extraction and classification process, adjusting training batches separately for each component to mitigate bias caused by class density discrepancy; 2) we train the network with both a density-aware loss and a learnable cost matrix for misclassifications. We demonstrate our model's improved performance in real-world medical datasets (TOPCAT and MIMIC-III) to show improved AUC-ROC, AUC-PRC, Brier Skill Score compared with the baselines in the domain.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges encountered in risk prediction in imbalanced medical data. Specifically, the paper focuses on the following key issues: 1. **Class imbalance problem**: - The incidence of medical events (such as mortality) is usually low, resulting in highly imbalanced data in electronic medical records. For example, when predicting mortality, the proportion of high - risk patients is far lower than that of the majority of surviving patients. - This imbalance can lead to bias when training models, thus affecting prediction performance. 2. **Limitations of existing methods**: - Traditional solutions (such as resampling or re - weighting) can partially alleviate the imbalance problem, but their effectiveness is limited in practical applications, and these methods are usually heuristic, lacking standardization and automation. - These methods may not be able to fully utilize the information in imbalanced data, and may even lead to over - fitting or under - fitting problems. 3. **Model calibration problem**: - Not only prediction performance, but also existing evaluation metrics (such as AUC - ROC) are prone to produce overly optimistic results on imbalanced data, ignoring the model calibration problem. Therefore, it is necessary to more accurately evaluate whether the predicted probabilities of the model are reliable. To solve these problems, the author proposes a new framework, called **Density - Aware Personalized Training (DAPT)**, which is mainly improved in the following two aspects: 1. **Decoupling the feature extraction and classification processes**: - Train the feature extraction and classification processes separately to reduce the bias caused by class density differences. Specifically, adjust the training batches of each component respectively to ensure that both the feature extractor and the classifier can better adapt to imbalanced data. 2. **Introducing density - aware loss and a learnable cost matrix**: - Use a density - aware loss function to consider the data density differences between the majority class and the minority class, and introduce a learnable cost matrix to personalize the cost of misclassification. This enables the model to better identify subtle differences in different risk groups and improve prediction performance. Through these improvements, the author demonstrates the superior performance of their model on real - world medical datasets (such as TOPCAT and MIMIC - III), especially performing better on evaluation metrics such as AUC - PRC and Brier Skill Score.