Using PySpark to accelerate batch data point rotation for paleogeographic reconstruction
Shuyan Xu Linshu Hu Haipeng Li Mengjiao Qin Sensen Wu Zhenhong Du a School of Earth Sciences,Zhejiang University,Hangzhou,People's Republic of Chinab Zhejiang Provincial Key Laboratory of Geographic Information Science,Hangzhou,People's Republic of Chinac The Earth System Big Data Platform of the School of Earth Sciences,Zhejiang University,Hangzhou,People's Republic of Chinad Deep-time Digital Earth Research Center of Excellence (Suzhou),Kunshan,People's Republic of Chinae Deep-time Digital Earth Research Center of Excellence,Hangzhou,People's Republic of Chinaf School of Safety Science and Emergency Management,Wuhan University of Technology,Wuhan,People's Republic of China
DOI: https://doi.org/10.1080/17538947.2024.2428699
IF: 4.606
2024-11-26
International Journal of Digital Earth
Abstract:Batch paleogeographic point rotation (BPPR) is a PySpark-based extensible batch data point rotation method that accelerates rotation during paleogeographic reconstruction. Data point rotation is an important part of paleogeographic reconstruction and a significant tool for exploring the co-evolution of Earth and life. However, current point rotation techniques have challenges with processing speeds when handling extensive paleogeographic data. Therefore, this study introduced a parallel-computing framework to construct a BPPR. This method combines PySpark and PyGPlates, which can partition points and compute them simultaneously in multiple threads. The rotation of 232,277 fossil occurrences from the Cretaceous Period in the Paleobiology Database (PBDB) was completed within 26 s. By contrast, an alternative GPlates method completed the same task within 96 s. The proposed method supports CSV, EXCEL, SHP, and other data formats, thereby avoiding possible software switching requirements when using methods associated with GPlates. Using synthetic and real paleontological data as experimental datasets, BPPR proved to be nine times more efficient than GPlates when rotating 900,000 points. This efficiency improvement significantly enhanced data-driven paleogeographic analysis. The parallel strategy employed can be broadly applied to massive data analysis in geoscience.
geography, physical,remote sensing