Abstract:An index for reporting air quality is called the air quality index (AQI). It measures the impact of air pollution on a person's health over a short period of time. The purpose of the AQI is to educate the public on the negative health effects of local air pollution. The amount of air pollution in Indian cities has significantly increased. There are several ways to create a mathematical formula to determine the air quality index. Numerous studies have found a link between air pollution exposure and adverse health impacts in the population. Data mining techniques are one of the most interesting approaches to forecast AQI and analyze it. The aim of this paper is to find the most effective way for AQI prediction to assist in climate control. The most effective method can be improved upon to find the most optimal solution. Hence, the work in this paper involves intensive research and the addition of novel techniques such as SMOTE to make sure that the best possible solution to the air quality problem is obtained. Another important goal is to demonstrate and display the exact metrics involved in our work in such a way that it is educational and insightful and hence provides proper comparisons and assists future researchers. In the proposed work, three distinct methods—support vector regression (SVR), random forest regression (RFR), and CatBoost regression (CR)—have been utilized to determine the AQI of New Delhi, Bangalore, Kolkata, and Hyderabad. After comparing the results of imbalanced datasets, it was found that random forest regression provides the lowest root mean square error (RMSE) values in Bangalore (0.5674), Kolkata (0.1403), and Hyderabad (0.3826), as well as higher accuracy compared to SVR and CatBoost regression for Kolkata (90.9700%) and Hyderabad (78.3672%), while CatBoost regression provides the lowest RMSE value in New Delhi (0.2792) and the highest accuracy is obtained for New Delhi (79.8622%) and Bangalore (68.6860%). Regarding the dataset that was subjected to the synthetic minority oversampling technique (SMOTE) algorithm, it is noted that random forest regression provides the lowest RMSE values in Kolkata (0.0988) and Hyderabad (0.0628) and higher accuracies are obtained for Kolkata (93.7438%) and Hyderabad (97.6080%) in comparison to SVR and CatBoost regression, whereas CatBoost regression provides the highest accuracies for New Delhi (85.0847%) and Bangalore (90.3071%). This demonstrated definitely that datasets that had the SMOTE algorithm applied to them produced a higher accuracy. The novelty of this paper lies in the fact that the best regression models have been picked through thorough research by analyzing their accuracies. Moreover, unlike most related papers, dataset balancing is carried out through SMOTE. Moreover, all of the implementations have been documented via graphs and metrics, which clearly show the contrast in results and help show what actually caused the improvement in accuracy.

Predictive Modelling of Air Quality Index (AQI) Across Diverse Cities and States of India using Machine Learning: Investigating the Influence of Punjab's Stubble Burning on AQI Variability

Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis

Performance analysis of machine learning models for AQI prediction in Gorakhpur City: a critical study

Integrating machine learning techniques for Air Quality Index forecasting and insights from pollutant-meteorological dynamics in sustainable urban environments

Optimized machine learning model for air quality index prediction in major cities in India

Air pollution prediction with machine learning: a case study of Indian cities

Using Machine Learning to Predict Air Quality Index in New Delhi

Air Pollution Monitoring and Prediction using Machine Learning Algorithms

Spatial Air Quality Index and Air Pollutant Concentration prediction using Linear Regression based Recursive Feature Elimination with Random Forest Regression (RFERF): a case study in India

Forecasting of daily air quality index in Delhi

An Intelligent IoT-Cloud-Based Air Pollution Forecasting Model Using Univariate Time-Series Analysis

A novel seasonal index–based machine learning approach for air pollution forecasting

Machine learning for air quality index (AQI) forecasting: shallow learning or deep learning?

Air Quality and Public Health Risk Assessment: A Case of an Industrial Area in Haridwar City, Uttarakhand (India)

Prediction and Forecasting of Air Quality Index in Chennai using Regression and ARIMA time series models

Estimation of urban AQI based on interpretable machine learning

Evaluating air quality and criteria pollutants prediction disparities by data mining along a stretch of urban-rural agglomeration includes coal-mine belts and thermal power plants

Data-driven predictive modeling of PM2.5 concentrations using machine learning and deep learning techniques: a case study of Delhi, India

Machine learning-based prediction of air quality index and air quality grade: a comparative analysis

ML based assessment and prediction of air pollution from satellite images during COVID-19 pandemic

Daily scale air quality index forecasting using bidirectional recurrent neural networks: Case study of Delhi, India