Abstract:The increasing air pollution poses an urgent global concern with far-reaching consequences, such as premature mortality and reduced crop yield, which significantly impact various aspects of our daily lives. Accurate and timely analysis of air pollution is crucial for understanding its underlying mechanisms and implementing necessary precautions to mitigate potential socio-economic losses. Traditional analytical methodologies, such as atmospheric modeling, heavily rely on domain expertise and often make simplified assumptions that may not be applicable to complex air pollution problems. In contrast, Machine Learning (ML) models are able to capture the intrinsic physical and chemical rules by automatically learning from a large amount of historical observational data, showing great promise in various air quality analytical tasks. In this article, we present a comprehensive survey of ML-based air quality analytics, following a roadmap spanning from data acquisition to pre-processing, and encompassing various analytical tasks such as pollution pattern mining, air quality inference, and forecasting. Moreover, we offer a systematic categorization and summary of existing methodologies and applications, while also providing a list of publicly available air quality datasets to ease the research in this direction. Finally, we identify several promising future research directions. This survey can serve as a valuable resource for professionals seeking suitable solutions for their specific challenges and advancing their research at the cutting edge.
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the challenges in urban air pollution analysis, especially how to use machine - learning techniques to improve the accuracy, efficiency and reliability of air pollution analysis. Specifically, the paper focuses on the following aspects:
1. **Integration of heterogeneous data sources**: Urban air pollution analysis requires obtaining information from multiple data sources, such as local meteorology, traffic flow, pollution emissions and human activities. These data sources are highly heterogeneous, with different spatial resolutions, modalities, structures and densities, making them difficult to integrate. Therefore, developing advanced machine - learning techniques that can effectively fuse these heterogeneous data is an important challenge.
2. **Insufficient data coverage**: Machine - learning models usually require a large amount of observational data to achieve good performance. However, due to economic reasons, the number of monitoring sensors deployed in cities is limited, resulting in the problem of sparse data. For example, in Beijing, only 0.2% of the data is observed. This sparse and non - uniformly distributed air quality data may deviate from the true distribution of the entire data set, thus introducing biases in subsequent analysis tasks. Therefore, how to develop data - efficient machine - learning techniques is also a significant challenge.
3. **Complex spatio - temporal dependencies among pollutants**: Air pollution exhibits complex spatio - temporal dependencies because the spread and chemical reactions of different pollutants in time and space are very complex. For example, strong winds blowing from one location to another can transport pollutants, thereby enhancing the correlation between locations. Conversely, changes in wind direction will weaken this correlation. Traditional machine - learning models (such as support vector machines and random forests) rely on feature engineering and cannot handle such complex and nonlinear dynamic dependencies. Therefore, there is an urgent need to design more complex machine - learning models that can effectively capture the spatio - temporal dependencies among pollutants.
By solving the above problems, this paper aims to provide a comprehensive review of machine - learning techniques, covering the entire process from data acquisition to pre - processing, and then to various analysis tasks (such as pollution pattern mining, air quality inference and prediction). This will not only help researchers better understand the current research progress, but also provide valuable insights and directions for future research.