A Large-Scale Study of I/O Workload's Impact on Disk Failure.

Song Wu,Yusheng Yi,Jiang Xiao,Hai Jin,Mao Ye
DOI: https://doi.org/10.1109/ACCESS.2018.2866522
IF: 3.9
2018-01-01
IEEE Access
Abstract:In large-scale data centers, disk failure is the norm rather than an exception. Frequent disk failure noticeably hurts user experience and results in unavailability of data in the worst case. Previous researches from both industry and academia have studied the reasons of disk failure; however, there is a lack of knowledge of the intrinsic relation between failed disks and their I/O workload. In this paper, we collect and investigate about four billion drive hours I/O traces over 500 000 disks in Tencent's data centers. Our focus is to first exploit the key characteristics of I/O workload that influences disk reliability. We further present the impact of these I/O workload features on lifespan of disks and uncover the root causes. Finally, we introduce a new metric to accurately identify the "dangerous" I/O workload which is extremely harmful to disk health. To the best of our knowledge, this research is by far the first in-depth analysis of the I/O workload's impact on disk reliability and opens up a new dimension for I/O schedule policy in data centers.
What problem does this paper attempt to address?