Octopus-DF: Unified DataFrame-based cross-platform data analytic system

Rong Gu,Jun Shi,Xiaofei Chen,Zhaokang Wang,Yang Che,Kai Zhang,Yihua Huang
DOI: https://doi.org/10.1016/j.parco.2021.102879
IF: 0.983
2022-01-01
Parallel Computing
Abstract:Nowadays, DataFrame serves as a core to model and implement numerous machine learning and data analytic algorithms. Traditional data analytic programming languages, such as Python, provide the DataFrame programming model natively. In the big data era, it is a natural demand to introduce the DataFrame model into distributed computing systems for convenient big data analysis. Therefore, various DataFrame libraries have been implemented on Spark and Dask. However, these distributed computing systems contain some parallelism semantics which are not very straightforward for data analysts. Also, a DataFrame-based algorithm may have quite different performance for various datasets over different platforms. And, it is difficult for data analysts to choose the optimal platforms that achieve the best performance for their programs. To address these problems, we build a unified DataFrame-based data analytic system Octopus-DF. Octopus-DF integrates Pandas, Dask, and Spark as the backend computing platforms and exposes the most widely used Pandas-style APIs to users. Then, as DataFrame computation performance plays a critical role in the computing efficiency of DataFrame-based data analytic algorithms, we designed a set of DataFrame computation optimizations which are divided into two parts: (1) multiple indexing and DAG optimizations, and (2) cross-platform scheduling strategy. Experimental results show that Octopus-DF outperformed the existing single platforms with 11.72x speedup on average. Compared with the existing platform combination strategies, Octopus-DF can achieve the optimal one. Moreover, the proposed optimizations can effectively speedup the execution workflow.
What problem does this paper attempt to address?