A random forests-based hedonic price model accounting for spatial autocorrelation

Emre Tepe
DOI: https://doi.org/10.1007/s10109-024-00449-w
2024-11-08
Journal of Geographical Systems
Abstract:This paper introduces a spatially explicit random forests-based hedonic price modeling approach to account for spatial autocorrelation in the data. Spatial autocorrelation is a common data structure in georeferenced data, and controlling associations among spatial objects is crucial for accurate statistical analysis. Validations of machine learning and artificial intelligence applications require using out-of-sample data sets to assess models' fit on the training dataset. Previous research has shown that nonspatial cross-validation methods, commonly used in machine learning applications for spatial data, often provide over-optimistic results. Some recommended the use of spatial cross-validation methods to obtain more reliable estimates. However, the machine learning models used in these previous studies did not include spatially explicit parameters to account for spatial autocorrelation in the data. Unlike machine learning-based models, statistical-based models such as the spatial lag model can effectively account for spatial autocorrelation in the data. This research applied a two-stage least squares random forests framework to construct a hedonic pricing model incorporating a spatial lag for the Miami-Dade single-family residential parcel sales data. Random forests models are evaluated using K-fold , spatial blocking K-fold , and spatial leave-one-out cross-validation methods. The goodness-of-fit of the tested random forests-based models is evaluated using the coefficient of determination and mean square error scores. Additionally, spatial autocorrelations in residuals from random forests models are investigated by conducting Moran's I test. Our research indicates that failing to account for spatial autocorrelation in data can lead to unreliable and overly optimistic estimates. However, including a spatially lagged variable substantially reduces fluctuations in goodness-of-fit measures across different validation sets.
geography
What problem does this paper attempt to address?