Abstract:Air pollution events can be categorized as extreme or non-extreme on the basis of their magnitude of severity. High-risk extreme air pollution events will exert a disastrous effect on the environment. Therefore, public health and policy-making authorities must be able to determine the characteristics of these events. This study proposes a probabilistic machine learning technique for predicting the classification of extreme and non-extreme events on the basis of data features to address the above issue. The use of the naïve Bayes model in the prediction of air pollution classes is proposed to leverage its simplicity as well as high accuracy and efficiency. A case study was conducted on the air pollution index data of Klang, Malaysia, for the period of January 01, 1997, to August 31, 2020. The trained naïve Bayes model achieves high accuracy, sensitivity, and specificity on the training and test datasets. Therefore, the naïve Bayes model can be easily applied in air pollution analysis while providing a promising solution for the accurate and efficient prediction of extreme or non-extreme air pollution events. The findings of this study provide reliable information to public authorities for monitoring and managing sustainable air quality over time.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
The paper aims to address the issue of how to classify and predict the severity of air pollution events. Specifically, the research goal is to predict whether an air pollution event is an extreme event (high risk) or a non-extreme event (low risk) based on data features. To achieve this goal, the authors propose a probabilistic machine learning method based on the Naive Bayes model, leveraging its simplicity, high accuracy, and efficiency to solve the aforementioned problem.
### Research Background
Air pollution events can be categorized into extreme and non-extreme types based on their severity. High-risk extreme air pollution events have catastrophic impacts on the environment, thus it is crucial for public health and policy-making agencies to identify the characteristics of these events. Currently, many studies have applied various statistical techniques to analyze air pollution data, such as regression analysis, spatiotemporal techniques, extreme value theory, multivariate techniques, as well as neural networks and deep learning methods. However, these methods have certain limitations in terms of interpretability and computational cost. Therefore, researchers have started considering the use of probabilistic Bayesian frameworks for classification and prediction to provide more information and better model interpretability.
### Research Methods
1. **Data Source**: The study used Air Quality Index (API) data from Klang, Malaysia, from January 1, 1997, to August 31, 2020.
2. **Feature Engineering**: Three main features were defined: duration, intensity level, and severity. Duration refers to the continuous period during which the API value is greater than 100; intensity level refers to the maximum API value within a certain duration; severity refers to the cumulative value of API greater than 100 within a certain duration.
3. **Model Selection**: The Naive Bayes model was chosen as the classifier due to its simplicity, efficiency, and accuracy.
4. **Model Training and Evaluation**: The dataset was split into a training set (70%) and a test set (30%), using cross-validation and hyperparameter tuning to optimize model performance. The model's accuracy, sensitivity, specificity, and other metrics were evaluated using a confusion matrix.
### Results and Discussion
- **Data Distribution**: The distribution of duration and intensity showed skewness, so non-parametric kernel density estimation was used as an alternative to the Gaussian distribution model.
- **Model Performance**: The trained Naive Bayes model achieved an accuracy of 0.9751, sensitivity of 0.9661, and specificity of 0.9789 on the training set. It also performed well on the test set, indicating good generalization ability.
- **Comparative Analysis**: Compared to other classification methods (such as linear discriminant analysis, quadratic discriminant analysis, and multiple discriminant analysis), the Naive Bayes model performed better in terms of accuracy and performance.
### Conclusion
The study proposes a probabilistic classification method based on the Naive Bayes model to predict the severity of air pollution events. This method is not only accurate and efficient but also provides reliable decision support information, aiding public institutions in monitoring and managing sustainable air quality.