Optimization of Spark Storage Solutions

Yunping Feng,Haopeng Chen
DOI: https://doi.org/10.1109/pic.2016.7949547
2016-01-01
Abstract:With the increasing demands of big data processing, distributed data processing frameworks like Hadoop and Spark are enjoying growing popularity. To make the best use of these frameworks, performance enhancement becomes a key point to focus on. In this paper, we propose a cost-based optimization of Spark' s storage solutions. In Spark, data is presented as RDDs and they have storage levels to indicate their storage mechanism. Our optimization process is an offline optimization method, which consists of data sampling and training processes. From our evaluations, it shows that our cost-based optimization is effective. It can improve the performance of a Spark application by up to 16%.
What problem does this paper attempt to address?