Evaluation of gradient boosting and random forest methods to model subdaily variability of the atmosphere–forest CO2 exchange

Matti Kämäräinen,Anna Lintunen,Markku Kulmala,Juha-Pekka Tuovinen,Ivan Mammarella,Juha Aalto,Henriikka Vekuri,Annalea Lohila
DOI: https://doi.org/10.5194/bg-2022-108
2022-01-01
Abstract:Abstract. Accurate estimates of the net ecosystem CO2 exchange (NEE) would improve the understanding of the natural carbon sources and sinks and their role in the regulation of the global atmospheric carbon. In this work, we use and compare the random forest (RF) and the gradient boosting (GB) machine learning (ML) methods for predicting the year-round 6 hourly NEE over 1996–2018 in a pine-dominated boreal forest in southern Finland and analyze the predictability of the NEE. Additionally, aggregation to weekly NEE values was applied to get information about longer term behavior of the method. The meteorological ERA5 reanalysis variables were used as predictors. Spatial and temporal neighborhood (predictor lagging) was used to provide the models more data to learn from, which was found to improve the accuracy compared to using only the nearest grid cell and time step. Both ML methods can explain the temporal variability of the NEE in the observational site of this study with the meteorological predictors, but the GB method was more accurate. It was more effective in separating the important predictors from non-important ones, showing no signs of overfitting despite many redundant variables. The accuracy of the GB (RF), here measured mainly using cross-validated Pearson correlation coefficient between the model result and the observed NEE, was high (good), reaching a best estimate value of 0.96 (0.94) and the root mean square value of 1.18 µmol m⁻² s⁻¹ (1.35 µmol m⁻² s⁻¹). We recommend using GB instead of RF for modeling the CO2 fluxes of the ecosystems due to its better performance.
What problem does this paper attempt to address?