Feature Selection Approaches for Newborn Birthweight Prediction in Multiple Linear Regression Models

Esther Liu,Pei Xi Lin,Qianqi Wang,Karina Chen Feng
2024-11-18
Abstract:This project is based on the dataset "<a class="link-external link-http" href="http://exposome_NA.RData" rel="external noopener nofollow">this http URL</a>", which contains a subcohort of 1301 mother-child pairs who were enrolled into the HELIX study during pregnancy. Several health outcomes were measured on the child at birth or at age 6-11 years, taking environmental exposures of interest and other covariates into account. This report outlines the process of obtaining the best MLR model with optimal predictive power. We first obtain three candidate models we obtained from the forward selection, backward elimination and stepwise selection, and select the optimal model using various comparison schemes including AIC, Adjusted R^2 and cross-validation for 8000 repetitions. The report ended with some additional findings revealed by the selected model, along with restrictions on the method we use in the model selection process.
Numerical Analysis
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict the birth weight of newborns through the multiple linear regression model (MLR). Specifically, the goal of the study is to select the optimal set of features from a large number of environmental exposures and other covariates to construct an MLR model with the best predictive ability. ### Research Background and Problem Description 1. **Data Sources** - The data set is from "exposome_NA.RData", which contains a sub - cohort of 1,301 mother - infant pairs who participated in the HELIX study during pregnancy. - Multiple health outcomes of children at birth or at the age of 6 - 11 are recorded in the data set, and the environmental exposures of interest and other covariates are considered. 2. **Research Objectives** - To provide an optimal multiple linear regression model to predict the birth weight of newborns. - To evaluate and select the optimal model through methods such as cross - validation. ### Main Challenges 1. **Multicollinearity** - Multicollinearity refers to a high correlation among multiple independent variables, which makes it difficult to determine the actual impact of each independent variable on the dependent variable (birth weight). - Use the variance inflation factor (VIF) to detect and handle the multicollinearity problem, with the VIF threshold set at 10. 2. **Outliers and High - Leverage Points** - It is observed that the 985th observation point has an extremely high leverage value and influence (Cook’s Distance), indicating that it is an outlier. - The researchers not only examined the model including this outlier but also the model after removing this outlier to ensure the robustness of the conclusions. 3. **Model Selection Methods** - Use three different feature selection methods: forward selection, backward elimination, and stepwise selection. - Compare the performance of different models through indicators such as adjusted R², AIC (Akaike Information Criterion), and cross - validation. ### Solutions 1. **Feature Selection** - Eliminate variables with serious multicollinearity through VIF to reduce noise in the model. - Use the three methods of forward selection, backward elimination, and stepwise selection to construct candidate models respectively and compare them through multiple indicators. 2. **Model Evaluation** - Use indicators such as adjusted R², AIC, the sum of squared PRESS residuals, and the sum of squared DFFITS to evaluate model performance. - Conduct 8,000 - repeated cross - validations to ensure the stability and generalization ability of the model. 3. **Final Model Selection** - Finally, the forward selection model is selected as the optimal model for the following reasons: - The forward selection model has fewer predictor variables (67), which is easy to interpret. - The observations in the forward selection model have lower leverage values, reducing the impact of outliers. - The data distribution of the forward selection model is more compact (the smallest IQR). ### Conclusion Through the above methods, the researchers successfully constructed a multiple linear regression model that can effectively predict the birth weight of newborns and verified that this model satisfies the assumptions of linearity, normality, homoscedasticity, and independence.