WaveCatBoost for Probabilistic Forecasting of Regional Air Quality Data

Jintu Borah,Tanujit Chakraborty,Md. Shahrul Md. Nadzir,Mylene G. Cayetano,Shubhankar Majumdar
2024-04-08
Abstract:Accurate and reliable air quality forecasting is essential for protecting public health, sustainable development, pollution control, and enhanced urban planning. This letter presents a novel WaveCatBoost architecture designed to forecast the real-time concentrations of air pollutants by combining the maximal overlapping discrete wavelet transform (MODWT) with the CatBoost model. This hybrid approach efficiently transforms time series into high-frequency and low-frequency components, thereby extracting signal from noise and improving prediction accuracy and robustness. Evaluation of two distinct regional datasets, from the Central Air Pollution Control Board (CPCB) sensor network and a low-cost air quality sensor system (LAQS), underscores the superior performance of our proposed methodology in real-time forecasting compared to the state-of-the-art statistical and deep learning architectures. Moreover, we employ a conformal prediction strategy to provide probabilistic bands with our forecasts.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **how to improve the real - time prediction accuracy of regional air quality data**. Specifically, the paper proposes a new architecture named WaveCatBoost, aiming to predict the real - time concentrations of air pollutants by combining the Maximal Overlap Discrete Wavelet Transform (MODWT) with the CatBoost model. This method aims to extract signals from noise and improve the accuracy and robustness of prediction. In addition, the paper also adopts the conformal prediction strategy to provide probability intervals for prediction, thereby better quantifying the uncertainty of prediction. ### Main contributions of the paper: 1. **Innovative WaveCatBoost architecture**: It combines wavelet transform and CatBoost model and can effectively handle non - stationarity and long - term dependencies in time - series data. 2. **Real - time prediction ability**: It was evaluated on two different regional data sets, and the results show that this method is superior to existing statistical and deep - learning architectures in real - time prediction. 3. **Probability prediction**: It provides probability intervals for prediction through the conformal prediction method, enhancing the reliability of prediction results. 4. **Practical application value**: This method can be used in fields such as public health protection, sustainable development, pollution control and urban planning, and has important practical application value. ### Background of the paper: - **Air pollution problem**: Air pollution is a major global problem, which has a serious impact on public health and the environment. The World Health Organization has issued guidelines for six major air pollutants and set national ambient air quality standards. - **Limitations of existing methods**: Traditional prediction models have difficulties in dealing with non - linear relationships and non - stationary changes, and it is difficult to adapt to diverse monitoring environments and real - time prediction requirements. - **Application of machine learning**: In recent years, machine - learning algorithms have made significant progress in improving the accuracy of air quality prediction, but still face some challenges, such as target leakage and overly long training time. ### Method overview: - **Data collection and pre - processing**: Use the Central Pollution Control Board (CPCB) sensor network and the Low - Cost Air Quality Sensor System (LAQS) to collect real - time air quality data and perform pre - processing, including missing - value handling and hourly average calculation. - **WaveCatBoost model**: Decompose air quality data into high - frequency and low - frequency components through MODWT, then use the CatBoost model for modeling. Finally, combine the prediction results of each component through the inverse MODWT (IMODWT) to generate the final prediction value. - **Probability prediction**: Adopt the conformal prediction method to generate probability intervals based on point prediction to quantify the uncertainty of prediction. ### Experimental results: - **Performance evaluation**: The model was evaluated through four different prediction time windows (1 day, 7 days, 14 days and 31 days), and the results show that WaveCatBoost performs well in multiple time windows, especially in the long - term predictions of 14 days and 31 days. - **Statistical significance**: Through the Multiple Comparison with the Best (MCB) test, it was verified that the performance of the WaveCatBoost model is significantly better than other benchmark models. ### Conclusion: The WaveCatBoost model proposed in the paper performs excellently in real - time air quality prediction and can effectively capture non - stationarity and long - term dependencies in time - series data. This method not only improves the accuracy of prediction, but also provides more reliable prediction results through probability prediction. Future research directions include considering spatial dependence to further enhance the prediction ability of the model.