Predicting high-frequency nutrient dynamics in the Danube River with surrogate models using sensors and Random Forest

Yen Binh Tran,Leonardo F. Arias-Rodriguez,Jingshui Huang
DOI: https://doi.org/10.3389/frwa.2022.894548
2022-08-17
Frontiers in Water
Abstract:Nutrient dynamics play an essential role in aquatic ecosystems. Despite advances in sensor technology, nutrient concentrations are difficult and expensive to monitor in-situ and in real-time. Emerging data-driven methods may provide surrogate measures for nutrient concentrations. In this work, we use 4-years of water quality data with high-frequency (15-min) intervals acquired at 2 automatic stations in the German Danube River to train data-driven algorithms and build surrogate measures for nitrate ( NO 3 - -N), ammonium ( NH 4 + -N), and orthophosphate ( PO 4 3 - -P). Pre-processing of the data included removing outliers and filling missing values by linear interpolation. Multiple Linear Regression (MLR) and Random Forest (RF) are trained, cross-validated, and tested using dissolved oxygen (DO), temperature (Temp), conductivity (EC), pH, discharge rate (Q), and chlorophyll-a (Chl-a) as input futures. Additionally, we used time-series data to develop cyclical features to test improvements in the underlying relationship between data. This work presents a thorough description of the modeling workflow, including intermediate steps for feature engineering, feature selection, and hyperparameter optimization. In total, 12 surrogate models (2 algorithms * 3 constituents * 2 stations) are compared with R 2 and RMSE as error metrics. The results show that RF outperforms MLR when adding at least three predictors for all the surrogate models. The MLR models give R 2 -values for NO 3 - -N 0.67 and 0.89, NH 4 + -N 0.39 and 0.40, PO 4 3 - -P 0.34 and 0.54 of Pfelling station and Jochenstein station, respectively. RF models produce accurate predictions and low error performances for all the targets NO 3 - -N ( R 2 = 0.99 and 0.99), NH 4 + -N ( R 2 = 0.98 and 0.99), PO 4 3 - -P ( R 2 = 0.96 and 0.96). The percentage improvement of RMSE for RF compared to MLR in prediction nutrients ranges from 73 to 92%. This work demonstrates the usefulness of surrogate models using the RF algorithm when reproducing nutrient dynamics and serving as soft sensors for monitoring nutrient concentrations.
What problem does this paper attempt to address?