Strategies for Imputing Missing Values and Removing Outliers in the Dataset for Machine Learning-Based Construction Cost Prediction

Haneul Lee,Seokheon Yun
DOI: https://doi.org/10.3390/buildings14040933
IF: 3.324
2024-03-29
Buildings
Abstract:Accurately predicting construction costs during the initial planning stages is crucial for the successful completion of construction projects. Recent advancements have introduced various machine learning-based methods to enhance cost estimation precision. However, the accumulation of authentic construction cost data is not straightforward, and existing datasets frequently exhibit a notable presence of missing values, posing challenges to precise cost predictions. This study aims to analyze diverse substitution methods for addressing missing values in construction cost data. Additionally, it seeks to evaluate the performance of machine learning models in cost prediction through the removal of conditional outliers. The primary goal is to identify and propose optimal strategies for handling missing value in construction cost records, ultimately improving the reliability of cost predictions. According to the analysis results, among single imputation methods, median imputation emerges as the most suitable, while among multiple imputation methods, lasso regression imputation produces the most superior outcomes. This research contributes to enhancing the trustworthiness of construction cost predictions by presenting a pragmatic approach to managing missing data in construction cost performance records, thereby facilitating more precise project planning and execution.
construction & building technology,engineering, civil
What problem does this paper attempt to address?