Evaluation of the prediction effectiveness for geochemical mapping using machine learning methods: A case study from northern Guangdong Province in China
Songjian Lv,Ying Zhu,Li Cheng,Jingru Zhang,Wenjie Shen,Xingyuan Li
DOI: https://doi.org/10.1016/j.scitotenv.2024.172223
IF: 9.8
2024-04-15
The Science of The Total Environment
Abstract:This study compares seven machine learning models to investigate whether they improve the accuracy of geochemical mapping compared to ordinary kriging (OK). Arsenic is widely present in soil due to human activities and soil parent material, posing significant toxicity. Predicting the spatial distribution of elements in soil has become a current research hotspot. Lianzhou City in northern Guangdong Province, China, was chosen as the study area, collecting a total of 2908 surface soil samples from 0 to 20 cm depth. Seven machine learning models were chosen: Random Forest (RF), Support Vector Machine (SVM), Ridge Regression (Ridge), Gradient Boosting Decision Tree (GBDT), Artificial Neural Network (ANN), K-Nearest Neighbors (KNN), and Gaussian Process Regression (GPR). Exploring the advantages and disadvantages of machine learning and traditional geological statistical models in predicting the spatial distribution of heavy metal elements, this study also analyzes factors affecting the accuracy of element prediction. The two best-performing models in the original model, RF (R 2 = 0.445) and GBDT (R 2 = 0.414), did not outperform OK (R 2 = 0.459) in terms of prediction accuracy. Ridge and GPR, the worst-performing methods, have R 2 values of only 0.201 and 0.248, respectively. To improve the models' prediction accuracy, a spatial regionalized (SR) covariate index was added. Improvements varied among different methods, with RF and GBDT increasing their R 2 values from 0.4 to 0.78 after enhancement. In contrast, the GPR model showed the least significant improvement, with its R 2 value only reaching 0.25 in the improved method. This study concluded that choosing the right machine learning model and considering factors that influence prediction accuracy, such as regional variations, the number of sampling points, and their distribution, are crucial for ensuring the accuracy of predictions. This provides valuable insights for future research in this area.
environmental sciences