Investigating Memory Failure Prediction Across CPU Architectures

Qiao Yu,Wengui Zhang,Min Zhou,Jialiang Yu,Zhenli Sheng,Jasmin Bogatinovski,Jorge Cardoso,Odej Kao
2024-06-08
Abstract:Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this paper, we investigate the correlation between CEs and UEs across different CPU architectures, including X86 and ARM. Our analysis identifies unique patterns of memory failure associated with each processor platform. Leveraging Machine Learning (ML) techniques on production datasets, we conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm. Finally, an MLOps (Machine Learning Operations) framework is provided to consistently improve the failure prediction in the production environment.
Hardware Architecture,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of memory failure prediction in large data centers, particularly the prediction of Uncorrectable Errors (UEs) under different CPU architectures (such as X86 and ARM). Specifically, the paper attempts to solve the following key problems: 1. **Study of Memory Failure Correlation Across CPU Architectures**: Existing methods mainly utilize Correctable Errors (CEs) to predict UEs, but often overlook the variations of these errors across different CPU architectures, especially regarding the applicability of Error Correction Code (ECC). Therefore, the paper aims to investigate the correlation between CEs and UEs and identify memory failure patterns specific to different processor platforms, including X86 and ARM. 2. **Improving Prediction Accuracy**: By applying machine learning techniques to analyze production datasets, the paper attempts to improve the prediction accuracy of UEs. The research results show that memory failure prediction on different processor platforms achieved up to a 15% improvement in F1 score compared to existing algorithms. 3. **Building an MLOps Framework**: To ensure that the prediction models can continuously improve in actual production environments, the paper proposes an MLOps (Machine Learning Operations) framework. This framework can adapt to changes in server configurations, CPU architectures, memory types, and workloads, thereby continuously enhancing the performance of failure prediction throughout its lifecycle. In summary, the goal of this paper is to fill the current research gap in cross-CPU architecture memory failure prediction and to improve prediction accuracy by developing targeted prediction algorithms and an MLOps framework, ultimately enhancing the reliability and service availability of data centers.