FvRS: Efficiently Identifying Performance-Critical Data for Improving Performance of Big Data Processing

Gaoxiang Xu,Zhipeng Tan,Dan Feng,Laurence T. Yang,Wei Zhou,Xinyan Zhang,Yang Zhang,Jie Xu
DOI: https://doi.org/10.1016/j.future.2018.09.003
IF: 7.307
2019-01-01
Future Generation Computer Systems
Abstract:Hybrid storage is widely implemented in big data processing by providing large storage capacity and high access speed in an economical manner. Performance-critical data are usually stored in SSD to obtain the most performance benefits with the least storage cost. Conventional scheme identifies performance critical data based on data's access hotness. But it does not consider data's I/O cost and may store low-cost data in SSD, resulting in the waste of SSD. The recently proposed scheme determines performance-critical data based on both access hotness and I/O cost. However, it fails to accurately evaluate I/O cost, thus still distributes many low-cost data on SSD. In this paper, we propose a sequentiality-aware identification scheme for performance-critical data, called FvRS, which boots the accuracy of I/O cost evaluation by exploiting data's access sequentiality. The key idea is to evaluate data's I/O cost based on both request size and access sequentiality. By properly identifying high-cost hot data, FvRS maximizes the utilization of SSD to improve system performance. In addition, FvRS maintains performance-critical data in a real-time table to reduce the identification overhead. We have implemented FvRS in a hybrid storage system in Linux. Extensive evaluations using three real-workload traces and a famous benchmark Postmark demonstrate the accuracy and efficiency of FvRS. Compared with the state-of-the-art schemes, such as hotness-based identification and cost-based identification, FvRS reduces I/O response time by 10.3%similar to 45.6% and 16.3%similar to 25.1%, respectively. (C) 2018 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?