PAC: A monitoring framework for performance analysis of compression algorithms in Spark

Changpeng Zhu,Bo Han,Gang Li
DOI: https://doi.org/10.1016/j.future.2024.02.009
IF: 7.307
2024-02-01
Future Generation Computer Systems
Abstract:In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logical flows of Spark applications. This indicates a potential considerable interaction between compression algorithms and Spark applications regarding performance. Consequently, identifying factors that significantly impact the performance of compression algorithms in Spark, and subsequently, determining the actual performance benefits these algorithms provide to Spark applications, remains a significant challenge. To address the challenge, this paper presents a monitoring framework, named PAC, for conducting in-depth and systematic performance analysis of compression algorithms in Spark. As the pioneer of such monitoring frameworks, PAC is built on top of Spark core and collaborates with multiple monitors to collect various types of performance metrics of compressors, correlates and integrates them into structured tuples by the data transformer in PAC. This makes it easier to diagnosis of factors that have a significant influence on the performance of compression algorithms in Spark. Upon utilizing PAC, our experiments reveal that new determinants include the input/output data sizes and types of compression/decompression invocations, CPU consumption for compressing a massive amount of data, and hardware utilization, besides traditional determinants. Moreover, these experiments demonstrate that ZSTD is more susceptible to performance issues when compressing and decompressing small data, despite the overall input and output data being huge. In terms of performance, LZ4 serves as a viable alternative to ZSTD. These findings not only benefit researchers and developers in making more informed decisions in terms of configuring and tuning Spark execution environments but also sustainably boost the optimization of compression algorithms for Spark.
computer science, theory & methods
What problem does this paper attempt to address?