JeCache: Just-Enough Data Caching with Just-in-Time Prefetching for Big Data Applications.

Yifeng Luo,Jia Shi,Shuigeng Zhou
DOI: https://doi.org/10.1109/icdcs.2017.268
2017-01-01
Abstract:Big data clusters introduce an intermediate cache layer between the computing frameworks and the underlying distributed file systems, to enable upper-level applications or end users to efficiently access big datasets in cache and effectively share them among different computing frameworks. As caches are shared by multiple applications or end users, directly applying existing on-demand caching strategies will result in intense conflicts, when big datasets are cached as a whole. Meanwhile, big data applications usually involve massive numbers of file scans, cached-in data blocks may have little chance of being accessed before they are cached out to make way for other on-demand data blocks. Thus, it is unwise to cache data blocks long before they are actually accessed. In this paper, we propose a novel just-enough big data caching scheme for just-in-time block prefetching to improve the cache effectiveness of big data clusters. With just-in-time block prefetching, a block is cached in just before the task begins to process the block, rather than being cached in along with other blocks of the same dataset being processed. We monitor block accesses to measure the average processing time of data blocks, and then estimate the minimal number of blocks that should be kept in cache for a big dataset, so that the speed of data processing matches with that of data prefetching, and each upper-level task can obtain its input blocks from cache just in time. Our experimental results show that the proposed cache method can restrain over-requirement of cache resources in big data applications, and provides the same performance improvement as when all data blocks are cached.
What problem does this paper attempt to address?