Abstract:Introduction: This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". Background: Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners. Objectives: Explore the use of conditional logistic regression to increase the prediction accuracy. Methods: We analyzed an HCUP statewide inpatient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models. Results: The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 - 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures. Conclusions: It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.

Identification of high-risk beneficiaries in private healthcare insurance

Using massive health insurance claims data to predict very high-cost claimants: a machine learning approach

Medical Insurance Cost Analysis and Prediction using Machine Learning

A Machine Learning-Based Risk Assessment System Prediction Algorithm for Examining Medical Insurance Costs

Towards Better Detection of Fraud in Health Insurance Claims in Kenya: Use of Naïve Bayes Classification Algorithm

Machine Learning-Based Prediction for High Health Care Utilizers by Using a Multi-Institutional Diabetes Registry: Model Training and Evaluation

Identifying Diabetic Patients with High Risk of Readmission

A Framework for Predicting Impactability of Healthcare Interventions Using Machine Learning Methods, Administrative Claims, Sociodemographic and App Generated Data

Use of responsible artificial intelligence to predict health insurance claims in the USA using machine learning algorithms

Personalized Stratification of Hospitalization Risk Amidst COVID-19: A Machine Learning Approach.

Simplified Machine Learning Models Can Accurately Identify High-Need High-Cost Patients With Inflammatory Bowel Disease

Not there yet: using data-driven methods to predict who becomes costly among low-cost patients with type 2 diabetes

Predicting 30-day Hospital Readmission with Publicly Available Administrative Database. A Conditional Logistic Regression Modeling Approach

Building prediction models and discovering important factors of health insurance fraud using machine learning methods

Predicting high health-cost users among people with cardiovascular disease using machine learning and nationwide linked social administrative datasets

Prediction of pharmaceutical and non-pharmaceutical expenditures associated with Diabetes Mellitus type II based on clinical risk

Advances in Prediction of Readmission Rates Using Long Term Short Term Memory Networks on Healthcare Insurance Data

Early prediction of high-cost inpatients with ischemic heart disease using network analytics and machine learning

Design and development of big data-based model for detecting fraud in healthcare insurance industry

A Novel Machine Learning Algorithm for Creating Risk-Adjusted Payment Formulas

In-hospital mortality, readmission, and prolonged length of stay risk prediction leveraging historical electronic patient records