Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water

Isaiah K. Mutai,Kristof Van Laerhoven,Nancy W. Karuri,Robert K. Tewo
DOI: https://doi.org/10.3934/aci.2024008
2022-09-08
Abstract:There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.
Other Quantitative Biology,Machine Learning,Applications
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: using the Multivariate Linear Regression (MLR) model to predict the Biochemical Oxygen Demand (BOD) in wastewater, in order to reduce the dependence on traditional BOD measurement methods and improve the prediction efficiency and accuracy. Specifically, the author constructs an MLR model by selecting four water quality parameters that have a strong correlation with BOD, namely Dissolved Oxygen (DO), Nitrogen, Fecal Coliform, and Total Coliform, as input variables to predict the BOD value in wastewater. The main purpose of the study is to evaluate the performance of the MLR model in BOD prediction and explore the influence of different training set / test set ratios on the model performance. ### Main problem summary: 1. **Necessity of BOD prediction**: Traditional BOD measurement methods are time - consuming (taking 5 days), which may lead to delays in pollution response measures. Therefore, rapid and accurate BOD prediction is crucial for taking timely pollution prevention and control measures. 2. **Selection of appropriate input parameters**: By analyzing the correlation between multiple water quality parameters and BOD, it is determined that these four parameters, DO, Nitrogen, Fecal Coliform, and Total Coliform, have a strong association with BOD, so they are selected as input variables for the MLR model. 3. **Model performance evaluation**: By using different training set / test set ratios (80%/20% and 90%/10%), the prediction performance of the MLR model is evaluated, including indicators such as Root Mean Square Error (RMSE), correlation coefficient (r), and prediction accuracy rate. ### Research conclusions: - **Strongly correlated parameters**: There is a strong association between DO, Fecal Coliform, Total Coliform, Nitrogen and BOD. - **Optimal training set ratio**: The 80%/20% training set / test set ratio is the best choice. Although a higher training set ratio can slightly improve the accuracy of the model, the improvement in prediction ability is not significant. - **Performance of the MLR model**: With appropriate selection of input parameters, the MLR model can be successfully used for the prediction of wastewater BOD, and its performance is better than some other models (such as the artificial neural network model). ### Formula representation: The general form of the MLR model is: \[ y = C_1 x_1 + C_2 x_2 + C_3 x_3 + C_4 x_4 + \beta_0 \] where: - \( y \) is the dependent variable (BOD) - \( \beta_0 \) is the intercept - \( C_1, C_2, C_3, C_4 \) are the coefficients of the respective independent variables - \( x_1, x_2, x_3, x_4 \) are Dissolved Oxygen (DO), Nitrogen, Fecal Coliform, and Total Coliform respectively. It is hoped that this information can help you better understand the research purpose and content of this paper. If you have more questions or need further explanation, please feel free to let me know!