An Introduction to PAKDD CUP 2020 Dataset

Yi Liu,Shujie Han,Cheng He,Jiongzhou Liu,Fan Xu,Tao Huang,Patrick P. C. Lee
DOI: https://doi.org/10.1007/978-981-15-7749-9_1
2020-01-01
Abstract:With the rapid development of cloud services, disk storage has played an important role in large-scale production cloud systems. Predicting imminent disk failures is critical for maintaining data reliability. Our vision is that it is important for researchers to contribute to the development of new techniques for accurate and robust disk failure prediction. If researchers can discover any reasonable approaches for disk failure prediction in large-scale cloud systems, all IT and big data companies can benefit from such approaches to further enhance the robustness of the production cloud systems. With this vision in mind, we have published an open labeled dataset that spans a period of 18 months with a total of 220,000 hard drives collected from Alibaba Cloud. Our dataset is among the largest released in the community in terms of its scale and duration. To better understand our dataset, we present our dataset generation process and conduct a preliminary analysis on the characteristics of our dataset. Our open dataset has been adopted in the PAKDD2020 Alibaba AI Ops Competition, in which contestants proposed new disk failure prediction algorithms through the analysis and evaluation of the dataset.
What problem does this paper attempt to address?