Performance Evaluation of Regression Models in Predicting the Cost of Medical Insurance

Jonelle Angelo S. Cenita,Paul Richie F. Asuncion,Jayson M. Victoriano

DOI: https://doi.org/10.25147/ijcsr.2017.001.1.146

2023-04-25

Abstract:The study aimed to evaluate the regression models' performance in predicting the cost of medical insurance. The Three (3) Regression Models in Machine Learning namely Linear Regression, Gradient Boosting, and Support Vector Machine were used. The performance will be evaluated using the metrics RMSE (Root Mean Square), r2 (R Square), and K-Fold Cross-validation. The study also sought to pinpoint the feature that would be most important in predicting the cost of medical insurance.The study is anchored on the knowledge discovery in databases (KDD) process. (KDD) process refers to the overall process of discovering useful knowledge from data. It show the performance evaluation results reveal that among the three (3) Regression models, Gradient boosting received the highest r2 (R Square) 0.892 and the lowest RMSE (Root Mean Square) 1336.594. Furthermore, the 10-Fold Cross-validation weighted mean findings are not significantly different from the r2 (R Square) results of the three (3) regression models. In addition, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics observed that in the charges and smoker features the median of one group lies outside of the box of the other group, so there is a difference between the two groups. It concludes that Gradient boosting appears to perform better among the three (3) regression models. K-Fold Cross-Validation concluded that the three (3) regression models are good. Moreover, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics ceases that the highest charges are due to the smoker feature.

Machine Learning

What problem does this paper attempt to address?

The main objective of this paper is to evaluate the performance of three machine learning regression models (Linear Regression, Gradient Boosting, and Support Vector Machine) in predicting healthcare insurance costs and to determine which features are most important for predicting these costs. The research team used a dataset named "Medical Cost Personal Datasets" from Kaggle for analysis. By comparing the models' Root Mean Square Error (RMSE) and Coefficient of Determination (\(R^2\) value), they aimed to identify the most accurate predictive model. Additionally, the study conducted 10-fold cross-validation to further assess the models' generalization capabilities and used Exploratory Data Analysis (EDA) to identify key factors influencing medical costs. The main findings are as follows: - The Gradient Boosting model performed the best in predicting medical costs, with the highest \(R^2\) value (0.892) and the lowest RMSE (1336.594). - The results of the 10-fold cross-validation were similar to the \(R^2\) values, indicating that all three models performed well, but the Gradient Boosting model remained the best performer. - Exploratory Data Analysis revealed that the "smoker" feature is crucial for predicting high medical costs. Therefore, the paper recommends using the Gradient Boosting model for predicting medical costs, which can help insurance companies better formulate policies and manage resource allocation.

Performance Evaluation of Regression Models in Predicting the Cost of Medical Insurance

Machine Learning For An Explainable Cost Prediction of Medical Insurance

Medical Insurance Cost Prediction using Machine Learning

A Machine Learning-Based Risk Assessment System Prediction Algorithm for Examining Medical Insurance Costs

Medical Insurance Cost Analysis and Prediction using Machine Learning

Medical Insurance Cost Prediction

A Computational Intelligence Approach for Predicting Medical Insurance Cost

MEDICAL INSURANCE PREMIUM PREDICTION WITH MACHINE LEARNING

Comparison and Analysis of the Effectiveness of Linear Regression, Decision Tree, and Random Forest Models for Health Insurance Premium Forecasting

Health Insurance Cost Prediction Using Regression Machine Learning Models

Machine learning based methods for ratemaking health care insurance

Machine learning versus regression modelling in predicting individual healthcare costs from a representative sample of the nationwide claims database in France

Use of responsible artificial intelligence to predict health insurance claims in the USA using machine learning algorithms

Machine Learning-Based Prediction for High Health Care Utilizers by Using a Multi-Institutional Diabetes Registry: Model Training and Evaluation

A Study on a car Insurance purchase Prediction Using Two-Class Logistic Regression and Two-Class Boosted Decision Tree

Prediction of pharmaceutical and non-pharmaceutical expenditures associated with Diabetes Mellitus type II based on clinical risk

Risk prediction in life insurance industry using supervised learning algorithms

Using machine learning approaches to predict high-cost chronic obstructive pulmonary disease patients in China

Exploring the use of machine learning for risk adjustment: A comparison of standard and penalized linear regression models in predicting health care costs in older adults

Building predictive models of healthcare costs with open healthcare data

Examining different cost ratio frameworks for decision rule machine learning algorithms in diagnostic application