Performance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm.
Kun Ran,Yingbo Cui,Zihang Wang,Shaoliang Peng
DOI: https://doi.org/10.1007/978-981-97-0798-0_4
2024-01-01
Abstract:The utilization of large-scale datasets in various fields is increasing due to the advancement of big data technology. Due to limited computing resources, traditional serial frameworks are no longer efficient in processing such massive data. Furthermore, as Moore’s Law gradually loses its effect, improving program performance from the hardware level becomes increasingly challenging. Consequently, numerous parallel frameworks with distinct features and architectures have emerged, and selecting an appropriate one can enhance researchers’ performance across various tasks. This paper evaluates three prominent parallel frameworks-Spark, Ray, and MPI-and employs minimap2, a third-generation CPU-based sequence alignment tool, as the benchmark program. The experimental results are discussed comprehensively. To evaluate the three frameworks, we devised a parallel algorithm for minimap2 and implemented its parallel versions using Ray and MPI, respectively. Furthermore, we selected IMOS as the Spark version of minimap2. The experiments involved six real datasets and one simulated dataset to evaluate and compare speedup, efficiency, throughput, scalability, peak memory, latency, and load balance. The findings demonstrate that MPI outperforms Apache Spark and Ray in terms of achieving a maximum speedup of 104.019, 81.3% efficiency, 33.510 MB/s throughput, the lowest latency, and better load balance. However, MPI exhibits poor fault tolerance. Apache Spark demonstrated the second-best performance, with a speedup of 88.937, efficiency of 69.5%, throughput of 29.546 MB/s, low latency, and the best load balance. Furthermore, it exhibited good fault tolerance and benefited from a mature ecosystem. Ray achieves a speedup of 76.828, efficiency of 60.0%, and throughput of 25.009 MB/s. However, it experiences high latency fluctuations, possesses less load balance compared to the previous two frameworks, and maintains good fault tolerance. The source code and a comprehensive user manual for these parallel programs are available at https://github.com/Geehome/minimapR and https://github.com/Geehome/minimapM , respectively.