Efficient data preprocessing, episode classification, and source apportionment of particle number concentrations

Chun-Sheng Liang,Hao Wu,Hai-Yan Li,Qiang Zhang,Zhanqing Li,Ke-Bin He
DOI: https://doi.org/10.1016/j.scitotenv.2020.140923
2020-11-01
Abstract:<p>Number concentration is an important index to measure atmospheric particle pollution. However, tailored methods for data preprocessing and characteristic and source analyses of particle number concentrations (PNC) are rare and interpreting the data is time-consuming and inefficient. In this method-oriented study, we develop and investigate some techniques via flexible conditions, C++ optimized algorithms, and parallel computing in R (an open source software for statistics and graphics) to tackle these challenges. The data preprocessing methods include deletions of variables and observations, outlier removal, and interpolation for missing values (NA). They do better in cleaning data and keeping samples and generate no new outliers after interpolation, compared with previous methods. Besides, automatic division of PNC pollution events based on relative values suites PNC properties and highlights the pollution characteristics related to sources and mechanisms. Additionally, basic functions of <em>k</em>-means clustering, Principal Component Analysis (PCA), Factor Analysis (FA), Positive Matrix Factorization (PMF), and a newly-introduced model NMF (Non-negative Matrix Factorization) were tested and compared in analyzing PNC sources. Only PMF and NMF can identify coal heating and produce more explicable results, meanwhile NMF apportions more distinctly and runs 11–28 times faster than PMF. Traffic is interannually stable in non-heating periods and always dominant. Coal heating's contribution has decreased by 40%–86% in recent 5 heating periods, reflecting the effectiveness of coal burning control.</p>
environmental sciences
What problem does this paper attempt to address?