Chukonu: A Fully-Featured High-Performance Big Data Framework that Integrates a Native Compute Engine into Spark

Bowen Yu,Guanyu Feng,Huanqi Cao,Xiaohan Li,Zhenbo Sun,Haojie Wang,Xiaowei Zhu,Weimin Zheng,Wenguang Chen
DOI: https://doi.org/10.14778/3503585.3503596
IF: 2.5
2021-01-01
Proceedings of the VLDB Endowment
Abstract:Apache Spark is a widely deployed big data analytics framework that offers such attractive features as resiliency, load-balancing, and a rich ecosystem. However, there is still plenty of room for improvement in its performance. Although a data-parallel system in a native programming language significantly improves performance, it may require re-implementing many functionalities of Spark to become a full-featured system. It is desirable for native big data systems to just write a compute engine in native languages to ensure high efficiency, and reuse other mature features provided by Spark rather than re-implement everything. But the interaction between the JVM and the native world risks becoming a bottleneck. This paper proposes Chukonu, a native big data framework that re-uses critical big data features provided by Spark. Owing to our novel DAG-splitting approach, the potential Spark integration overhead is alleviated, and its even outperforms existing pure native big data frameworks. Chukonu splits DAG programs into run-time parts and compile-time parts: The run-time parts are delegated to Spark to offload the complexities due to feature implementations. The compile-time parts are natively compiled. We propose a series of optimization techniques to be applied to the compile-time parts, such as operator fusion, vectorization, and compaction, to significantly reduce the Spark integration overhead. The results of evaluation show that Chukonu has a speedup of up to 71.58x (geometric mean 6.09x) over Apache Spark, and up to 7.20x (geometric mean 2.30x) over pure-native frameworks on six commonly-used big data applications. By translating the physical plan produced by SparkSQL into Chukonu programs, Chukonu accelerates Spark-SQL's TPC-DS performance by 2.29x.
What problem does this paper attempt to address?