EXPERT: EXPloiting DRAM ERror Types to Improve the Effective Forecasting Coverage in the Field.

Xiangjun Peng,Zheng Huang,Alex Cantrell,Bi Hua Shu,Ke,Yi Li,Yu Li,Li Jiang,Qiang Xu,Ming-Chang Yang
DOI: https://doi.org/10.1109/dsn-s58398.2023.00022
2023-01-01
Abstract:DRAM failures, which are mostly caused by DRAM uncorrectable errors (UCEs), are one of the most critical factors for reliable services in computing systems. Prior work demonstrates the potential to utilize machine learning techniques for forecasting DRAM UCEs. However, they do not have the knowledge that different DRAM UCEs can be classified into different types. To this end, we obtain the first field dataset from a large datacenter of Alibaba Cloud, with the labels of different UCE types. Then, we propose EXPERT, a design to exploit such information to improve the effective forecasting coverage of DRAM UCEs. Finally, we evaluate the effectiveness of our approach against two state-of-the-art forecaster designs in the field, and the results show that EXPERT achieves up to 18.43% improvements on the effective coverage in terms of F1-Score.
What problem does this paper attempt to address?