Improved Data Splitting Methods for Data-Driven Hydrological Model Development Based on a Large Number of Catchment Samples

Junyi Chen,Feifei Zheng,Robert May,Danlu Guo,Hoshin Gupta,Holger R. Maier
DOI: https://doi.org/10.1016/j.jhydrol.2022.128340
IF: 6.4
2022-01-01
Journal of Hydrology
Abstract:Data-driven hydrological models are widely used for many practical purposes. However, the reliability of such models depend heavily on the strategy used to partition available observations into model calibration and evaluation subsets. Unfortunately, available data splitting methods are poor at ensuring consistency of statistical properties between different subsets, resulting in considerable bias and/or inconsistency in evaluation performance as well as poor generalization ability. To address this problem, we propose and test two new data splitting methods applied to hydrological models that do not consider time-dependent structure. The SOMPLEX approach uses a self-organizing map to analytically cluster the data based on its distributional properties, after which a portion of each cluster is allocated to the calibration and evaluation subsets using the previously developed DUPLEX method. In the MDUPLEX approach, rather than clustering the data, the DUPLEX allocation strategy is modified to better maintain statistical similarity of the data subsets. When tested using a data-driven rainfall-runoff modelling study applied to 754 catchments, the new methods were significantly better at splitting the data into subsets with similar statistical properties. However, performance of different methods was found to depend strongly on the skewness of the streamflow data, based on which we present practical recommendations regarding which method to use in different circumstances. Since the task of partitioning the data into mutually consistent statistically subsets is generic, these concepts are broadly applicable to many hydrological models where time-dependency of hydro-climatic data can be ignored, ranging from physics-based to data-driven, including machine learning.
What problem does this paper attempt to address?