A parallel feature selection method based on NMI-XGBoost and distance correlation for typhoon trajectory prediction
Baiyou Qiao,Jiaqi Wu,Rui Wang,Yuanqing Hao,Peirui Wang,Donghong Han,Gang Wu
DOI: https://doi.org/10.1007/s11227-023-05863-3
IF: 3.3
2024-01-23
The Journal of Supercomputing
Abstract:Typhoon trajectory related data involve many factors, such as atmospheric factors, oceanic factors, and physical factors. It has the characteristics of high dimension, strong spatio-temporal correlation, and nonlinear correlation, which increases the difficulty of typhoon trajectory prediction. Using feature selection approaches to select appropriate prediction factors becomes an important means to reduce the dimension of typhoon trajectory related data and improve the performance and accuracy of typhoon trajectory prediction methods. However, the existing feature selection methods based on linear correlation analysis cannot well depict the nonlinear correlation between data features, which results in low accuracy of feature selection. The feature selection methods based on nonlinear correlation analysis are computationally expensive, which affects the timeliness of feature selection. To solve the problem, we propose a parallel feature selection method NX-Spark-DC based on the Spark platform for typhoon trajectory related data. The method firstly filters out the redundant features of typhoon related data by normalized mutual information (NMI) method, subsequently eliminates the useless features by XGBoost machine learning model, and thus reducing the dimension of typhoon related data. On this basis, an improved Spark-based parallel distance correlation algorithm (Spark-DC) is proposed to select the feature combinations with strong correlation. A series of experimental results show that NX-Spark-DC method has high execution efficiency and accuracy, which is significantly better than the existing methods.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture