Soil data recency: The foundation for harmonizing soil data across time

Tegbaru B Gobezie,Stacey D Scott,Prasad Daggupati,Angela Bedard-Haughn,Asim Biswas
DOI: https://doi.org/10.1016/j.jenvman.2024.121484
Abstract:Sustainable soil resource management depends on reliable soil information, often derived from 'legacy soil data' or a combination of old and new soil data. However, the task of harmonizing soil data collected at different times remains a largely unexplored in the literature. Addressing this challenge requires incorporating the temporal dimension into mathematical and statistical models for spatio-temporal soil studies. This study aimed to create a comprehensive framework for harmonizing soil data across various time. We assessed the integration of historical and recent soil data, ranging from 4 to 48 years old data, using soil data recency analysis. To achieve this, we introduced an 'age of data' attribute, calculating the time difference between soil survey years and the present (e.g., 2022). We applied three machine learning models - Decision Trees (DT), Random Forest (RF), Gradient Boosting (GBM) - to a dataset containing 6339 sites and 28,149 depth-harmonized layers. The results consistently demonstrated robust performance across models, RF outperforming with an R-squared value of 0.99, RMSE of 1.41, and a concordance of 0.97. Similarly, DT and GBM also showed strong predictive power. Terrain-derived environmental covariates played a more important role than land use and land cover (LULC) change in predicting soil data recency. While LULC change showed soil organic carbon concentration variability across the different depths, it was a less important factor. Anthropogenic factors, such as LULC change and normalized difference vegetation index (NDVI), were not primary determinants of soil data recency. Variations in soil depth had no impact on predicting soil data recency. This study validated that terrain-derived covariates, especially elevation factors, effectively explain the quality of older soil data when predicting current soil attributes using the soil data recency concept. This approach has the potential to enhance real-time estimates, such as carbon budgets, and we emphasize its importance in global earth system models.
What problem does this paper attempt to address?