RDDShare: Reusing Results of Spark RDD

Huang Chao-qiang,Yang Shu-qiang,Tang Jian-chao,Yan Zhou
DOI: https://doi.org/10.1109/dsc.2016.80
2016-01-01
Abstract:In recent years, Spark has become a hotspot for big data processing. For a single user, Spark provides the cache method to share the results between the jobs in a single application. When accessed concurrently by multi users, there may exist same computation among the submitted applications, however, Spark does not provide a method to share computing results between applications. In traditional databases, one way to optimize the performance of queries is to cache part or all of the results of a query to share with other requests. Based on this, we propose RDDShare system based on Spark SQL to manage the cache and reuse the results. Finally, the results of simulate experiments show that RDDShare system can significant optimize the query performance of Spark SQL.
What problem does this paper attempt to address?