Prognostic Models for Sepsis Built on Small Datasets
Chunyan Li,Lu Wang,Kexun Li,Hongfei Deng,Yu Wang,Li Chang,Ping Zhou,Jun Zeng,Mingwei Sun,Hua Jiang,Qi Wang
DOI: https://doi.org/10.2139/ssrn.4317582
2023-01-01
Abstract:Background and Objectives: Sepsis is one of the common causes of death in intensive care units. A reliable prognostic model based on patients’ data of relevant indicators would enable clinicians to make treatment decisions to improve clinical outcomes for septic patients. This study aims to develop a machine-learning framework for developing such prognostic tools by exploring the class-imbalanced longitudinal data of a small group of septic patients.Methods: An engineered input dataset is devised in the form of concatenated triples to increase the data size relative to the dimension of the variable or feature space. Each concatenated triplet consists of a patient’s static data, the k-day consecutively longitudinal data, and the clinical outcome (k=2,3,4,5). The structured input data are then used to train classifiers in combination with appropriate feature engineering techniques. The trained classifiers are validated on a new dataset to ensure their clinical efficacy. We implement the modeling approach using five classifiers: K nearest neighbors, Logistic Regression, Support Vector Machine, Random Forest (RF), and Extreme Gradient Boosting (XGBoost) coupled with a set of feature engineering techniques. AUROC and a new metric, γ, made up of the F1 score on the external validation set, are used to assess the efficacy of the models.Results: Five prognostic models are built on the engineered input dataset accounting for 10 selected dynamic features, in which XGBoost (AUROC=0.777, F1 score=0.694) and RF (AUROC=0.769, F1 score=0.647) combined with the ensemble under-sampling strategy outperform their peers in the external validation. This study shows that the developed framework can greatly improve the accuracy and generalizability of standard classifiers. The improvement in (AUROC, overfitting) are (6.66%, 54.96%) and (0.52%, 77.72%) for RF and XGBoost, respectively, over the two models without feature engineering.Conclusion: A new modeling framework is devised to develop prognostic tools for septic patients concerning their mortality using small, class-imbalanced, and high-dimensional datasets. It enables standard classifiers to use small datasets to achieve relatively high predictability by engineering new structured datasets encoded with temporal features, sampling strategies, and dimension reduction techniques, providing clinically useful prognostic models and setting an example for applying machine learning methods to small data problems in medicine.