Reference-distance Eviction and Prefetching for Cache Management in Spark

Tiago B. G. Perez,Xiaobo Zhou,Dazhao Cheng
DOI: https://doi.org/10.1145/3225058.3225087
2018-01-01
Abstract:Optimizing memory cache usage is vital for performance of in-memory data-parallel frameworks such as Spark. Current data-analytic frameworks utilize the popular Least Recently Used (LRU) policy, which does not take advantage of data dependency information available in the application's directed acyclic graph (DAG). Recent research in dependency-aware caching, notably MemTune and Least Reference Count (LRC), have made important improvements to close this gap. But they do not fully leverage the DAG structure, which imparts information such as the time-spatial distribution of data references across the workflow, to further improve cache hit ratio and application runtime. In this paper, we propose and develop a new cache management policy, Most Reference Distance (MRD) that utilizes DAG information to optimize both eviction and prefetching of data to improve cache management. MRD takes into account the relative stage distance of each data block reference in the application workflow, effectively evicting the furthest and least likely data in the cache to be used, while aggressively prefetching the nearest and most likely data that will be needed, and in doing so, better overlapping computation with I/O time. Our experiments with a Spark implementation, utilizing popular benchmarking workloads show that, MRD has low overhead and improves performance by an average of 53% compared to LRU, and up to 68% and 45% when compared to MemTune and LRC respectively. It works best for I/O-intensive workloads.
What problem does this paper attempt to address?