Workload-Aware Scheduling for Data Analytics upon Heterogeneous Storage

Zhuzhong Qian,Yuan Gao,Mingtao Ji,Hui Peng,Peng Chen,Yibo Jin,Sanglu Lu
DOI: https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00088
2019-01-01
Abstract:A trend in nowadays data centers is that equipped with SSD, HDD, etc., heterogeneous storage devices are widely deployed to meet diverse demands of various big data workloads. Since the reading performance of various storage devices are quite different, traditional concurrent data fetching easily incurs unbalanced use among devices. As a result, the straggler in terms of the data fetching, derived from the unbalanced use, directly increases the overall latency of data analytics. To avoid such unbalanced use on fetching large volume of data concurrently from storage devices, we formulate Workload-Aware Scheduling problem for Heterogeneous storage devices (WASH), the goal of which is to minimize the maximum data fetching time for parallel data analytical tasks. We design a randomized algorithm (rWASH) to select a proper source device for each task based on delicate calculated probabilities, which can be proved concentrated on its optimum with high probability. Extensive experiments show that rWASH reduces the average data fetching time for tasks by up to 55% over the state-of-the-art algorithms.
What problem does this paper attempt to address?