Classification of Deceased Patients from Non-Deceased Patients using Random Forest and Support Vector Machine Classifiers

Dheeman Saha,Aaron Segura,Biraj Tiwari
2024-11-28
Abstract:Analyzing large datasets and summarizing it into useful information is the heart of the data mining process. In healthcare, information can be converted into knowledge about patient historical patterns and possible future trends. During the COVID-19 pandemic, data mining COVID-19 patient information poses an opportunity to discover patterns that may signal that the patient is at high risk for death. COVID-19 patients die from sepsis, a complex disease process involving multiple organ systems. We extracted the variables physicians are most concerned about regarding viral septic infections. With the aim of distinguishing COVID-19 patients who survive their hospital stay and those COVID-19 who do not, the authors of this study utilize the Support Vector Machine (SVM) and the Random Forest (RF) classification techniques to classify patients according to their demographics, laboratory test results, and preexisting health conditions. After conducting a 10-fold validation procedure, we assessed the performance of the classification through a Receiver Operating Characteristic (ROC) curve, and a Confusion Matrix was used to determine the accuracy of the classifiers. We also performed a cluster analysis on the binary factors, such as if the patient had a preexisting condition and if sepsis was identified, and the numeric values from patient demographics and laboratory test results as predictors.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to distinguish patients who died of COVID - 19 from those who survived by using machine - learning classifiers such as Random Forest and Support Vector Machine. Specifically, the research aims to: 1. **Identify key features**: Determine which patient characteristics (including demographic information, laboratory test results, and pre - existing health conditions) are associated with a higher risk of death. 2. **Build a prediction model**: Use these features to construct a classification model to predict the survival of COVID - 19 patients. 3. **Evaluate model performance**: Evaluate the performance of the model through methods such as cross - validation, ROC curve, and confusion matrix. ### Problem background During the COVID - 19 pandemic, data - mining techniques can help discover patterns in patient data, thereby identifying patients who may be at high risk of death. Especially for sepsis caused by the virus, which is a complex multi - organ system disease process, early identification and intervention are crucial. Therefore, researchers hope to use machine - learning methods to extract useful information from a large amount of patient data to help medical workers better understand and predict the development of patients' conditions. ### Research objectives - **Distinguish between surviving and deceased patients**: By analyzing the laboratory test results during the initial and last hospitalizations, past medical histories, and other demographic information of patients, develop a classification model that can effectively distinguish between surviving and deceased patients. - **Improve clinical decision support**: Provide tools for hospitals and medical staff to more accurately assess and manage the conditions of COVID - 19 patients, especially in the case of sepsis. - **Guide future research**: By analyzing the results of the model, find out which factors have a significant impact on the survival rate of patients, thereby providing directions for future medical research. ### Method overview Researchers used two main classification algorithms: - **Support Vector Machine (SVM)**: Used to handle linearly non - separable data by finding the optimal hyperplane to separate samples of different classes. - **Random Forest (RF)**: An ensemble learning method that classifies by constructing multiple decision trees and synthesizing their results. In addition, pre - processing steps such as cluster analysis, principal component analysis (PCA), missing - value handling, and outlier detection were also carried out to ensure the quality of the data and the effectiveness of the model. ### Formula summary - **Gini index**: Used to measure the purity of a node, and the formula is as follows: \[ Gini(p)=1-\sum_{i = 1}^{c}p_{i}^{2} \] where \(p_{i}\) is the proportion of samples of the \(i\)-th class. - **Gini index after splitting**: When the data set \(D\) is split into two subsets \(D_{1}\) and \(D_{2}\) on the attribute \(a\), the new Gini index is: \[ Gini(D,a)=\frac{|D_{1}|}{|D|}Gini(D_{1})+\frac{|D_{2}|}{|D|}Gini(D_{2}) \] - **LOF (Local Outlier Factor)**: Used to detect outliers, and the formula is as follows: \[ LOF(p)=\frac{\sum_{o\in N_{k}(p)}\frac{lrd(o)}{lrd(p)}}{|N_{k}(p)|} \] where \(lrd(p)\) is the local reachability density of point \(p\), and \(N_{k}(p)\) is the set of \(k\)-nearest neighbors of \(p\). Through these methods, researchers hope to be able to construct a reliable prediction model to help the medical system better cope with the challenges brought by COVID - 19 and its complications.