Feature Selection and Regression Models for Multisource Data-Based Soil Salinity Prediction: A Case Study of Minqin Oasis in Arid China

Sheshu Zhang,Jun Zhao,Jianxia Yang,Jinfeng Xie,Ziyun Sun
DOI: https://doi.org/10.3390/land13060877
IF: 3.905
2024-06-19
Land
Abstract:(1) Monitoring salinized soil in saline–alkali land is essential, requiring regional-scale soil salinity inversion. This study aims to identify sensitive variables for predicting electrical conductivity (EC) in soil, focusing on effective feature selection methods. (2) The study systematically selects a feature subset from Sentinel-1 C SAR, Sentinel-2 MSI, and SRTM DEM data. Various feature selection methods (correlation analysis, LASSO, RFE, and GRA) are employed on 79 variables. Regression models using random forest regression (RF) and partial least squares regression (PLSR) algorithms are constructed and compared. (3) The results highlight the effectiveness of the RFE algorithm in reducing model complexity. The model incorporates significant environmental factors like soil moisture, topography, and soil texture, which play an important role in modeling. Combining the method with RF improved soil salinity prediction (R2 = 0.71, RMSE = 1.47, RPD = 1.84). Overall, salinization in Minqin oasis soils was evident, especially in the unutilized land at the edge of the oasis. (4) Integrating data from different sources to construct characterization variables overcomes the limitations of a single data source. Variable selection is an effective means to address the redundancy of variable information, providing insights into feature engineering and variable selection for soil salinity estimation in arid and semi-arid regions.
environmental studies
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the prediction of soil salinity in arid areas through multi - source data (such as Sentinel - 1 C SAR, Sentinel - 2 MSI and SRTM DEM data). Specifically, the research aims to: 1. **Identify sensitive variables**: Systematically select feature subsets from multi - source data to predict soil electrical conductivity (EC), with an emphasis on effective feature selection methods. 2. **Construct regression models**: Use random forest regression (RF) and partial least squares regression (PLSR) algorithms to construct and compare regression models in order to improve the accuracy of soil salinity prediction. 3. **Evaluate model performance**: Improve the soil salinity prediction model by reducing model complexity and incorporating important environmental factors (such as soil moisture, terrain and soil texture), and evaluate its performance indicators (such as \(R^2\), RMSE and RPD). The ultimate goal of the research is to provide valuable guidance in soil salinity prediction in arid areas, especially for regions like Minqin Oasis. By integrating data from different sources, overcome the limitations of a single data source, and reduce the redundancy of variable information through feature selection, thereby improving the predictive ability and interpretability of the model.