Abstract:The increasing air pollution poses an urgent global concern with far-reaching consequences, such as premature mortality and reduced crop yield, which significantly impact various aspects of our daily lives. Accurate and timely analysis of air pollution is crucial for understanding its underlying mechanisms and implementing necessary precautions to mitigate potential socio-economic losses. Traditional analytical methodologies, such as atmospheric modeling, heavily rely on domain expertise and often make simplified assumptions that may not be applicable to complex air pollution problems. In contrast, Machine Learning (ML) models are able to capture the intrinsic physical and chemical rules by automatically learning from a large amount of historical observational data, showing great promise in various air quality analytical tasks. In this article, we present a comprehensive survey of ML-based air quality analytics, following a roadmap spanning from data acquisition to pre-processing, and encompassing various analytical tasks such as pollution pattern mining, air quality inference, and forecasting. Moreover, we offer a systematic categorization and summary of existing methodologies and applications, while also providing a list of publicly available air quality datasets to ease the research in this direction. Finally, we identify several promising future research directions. This survey can serve as a valuable resource for professionals seeking suitable solutions for their specific challenges and advancing their research at the cutting edge.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the challenges in urban air pollution analysis, especially how to use machine - learning techniques to improve the accuracy, efficiency and reliability of air pollution analysis. Specifically, the paper focuses on the following aspects: 1. **Integration of heterogeneous data sources**: Urban air pollution analysis requires obtaining information from multiple data sources, such as local meteorology, traffic flow, pollution emissions and human activities. These data sources are highly heterogeneous, with different spatial resolutions, modalities, structures and densities, making them difficult to integrate. Therefore, developing advanced machine - learning techniques that can effectively fuse these heterogeneous data is an important challenge. 2. **Insufficient data coverage**: Machine - learning models usually require a large amount of observational data to achieve good performance. However, due to economic reasons, the number of monitoring sensors deployed in cities is limited, resulting in the problem of sparse data. For example, in Beijing, only 0.2% of the data is observed. This sparse and non - uniformly distributed air quality data may deviate from the true distribution of the entire data set, thus introducing biases in subsequent analysis tasks. Therefore, how to develop data - efficient machine - learning techniques is also a significant challenge. 3. **Complex spatio - temporal dependencies among pollutants**: Air pollution exhibits complex spatio - temporal dependencies because the spread and chemical reactions of different pollutants in time and space are very complex. For example, strong winds blowing from one location to another can transport pollutants, thereby enhancing the correlation between locations. Conversely, changes in wind direction will weaken this correlation. Traditional machine - learning models (such as support vector machines and random forests) rely on feature engineering and cannot handle such complex and nonlinear dynamic dependencies. Therefore, there is an urgent need to design more complex machine - learning models that can effectively capture the spatio - temporal dependencies among pollutants. By solving the above problems, this paper aims to provide a comprehensive review of machine - learning techniques, covering the entire process from data acquisition to pre - processing, and then to various analysis tasks (such as pollution pattern mining, air quality inference and prediction). This will not only help researchers better understand the current research progress, but also provide valuable insights and directions for future research.

Machine Learning for Urban Air Quality Analytics: A Survey

Machine Learning for Urban Air Quality Analytics: A Survey

Application of Machine Learning in Atmospheric Pollution Research: A State-of-art Review

The Application of Machine Learning to Air Pollution Research: A Bibliometric Analysis

Supervised Machine Learning Approaches for Predicting Key Pollutants and for the Sustainable Enhancement of Urban Air Quality: A Systematic Review

Air quality and urban sustainable development: the application of machine learning tools

Forecasting Smog-Related Health Hazard Based on Social Media and Physical Sensor.

Data-Driven Machine Learning in Environmental Pollution: Gains and Problems

An overview of air quality analysis by big data techniques: Monitoring, forecasting, and traceability

A review of machine learning for modeling air quality: Overlooked but important issues

Machine Learning Models for Predicting Air Pollution

Data-Driven Air Quality Characterization for Urban Environments: A Case Study

Machine Learning Techniques for Spatio-Temporal Air Pollution Prediction to Drive Sustainable Urban Development in the Era of Energy and Data Transformation

Air Quality Forecasting Using Machine Learning: A Global perspective with Relevance to Low-Resource Settings

Air Pollution Monitoring and Prediction using Machine Learning Algorithms

Air pollution prediction with machine learning: a case study of Indian cities

The Development and Application of Machine Learning in Atmospheric Environment Studies

Enhancing Air Quality Prediction with Social Media and Natural Language Processing

Data-driven Air Quality Characterisation for Urban Environments: a Case Study

Machine learning analysis of socioeconomic drivers in urban ozone pollution in Chinese cities

Extracting Regional and Temporal Features to Improve Machine Learning for Hourly Air Pollutants in Urban India