Comparing the use of all data or specific subsets for training machine learning models in hydrology: A case study of evapotranspiration prediction
Haiyang Shi,Geping Luo,Olaf Hellwich,Xiufeng He,Mingjuan Xie,Wenqiang Zhang,Friday U. Ochege,Qing Ling,Yu Zhang,Ruixiang Gao,Alishir Kurban,Philippe De Maeyer,Tim Van de Voorde
DOI: https://doi.org/10.1016/j.jhydrol.2023.130399
IF: 6.4
2023-11-04
Journal of Hydrology
Abstract:Machine learning has been widely used in hydrological modeling. However, the question of whether to use all data for modeling or only a specific subset for modeling and its implications are rarely investigated explicitly. As a case study, combining evapotranspiration (ET) observations from 168 flux stations, meteorological and biophysical variables, we used Random Forests to separately construct an 'All data' model trained with all data and 6 'plant functional type (PFT) specific' models trained with specific PFT data (i.e., Forest, Grassland, Cropland, Shrubland' Savannah, Wetland). We found ET simulations between different specific PFTs are transferable. The 'All data' model captured better ET and had a higher R-squared at 94 of 168 sites, especially in Wetland, Shrubland, Cropland, and Grassland types. Compared to using the 'All data' model, the 'PFT specific' model can further improve the accuracy in high R-squared grassland sites by reducing the effect of confusion of other PFTs and constraining the variance of the training data. When shifting from the 'All data' model to the 'PFT specific' model, the increase in the degree of encapsulation of the training set into the prediction set leads to a decrease in the R-squared. Accuracy pre-evaluation may be necessary before applying models trained from either all data or subset data.
geosciences, multidisciplinary,water resources,engineering, civil