Corrigendum to: A Systematic Study of DDR4 DRAM Faults in the Field

Majed Valad Beigi,Yi Cao,Sudhanva Gurumurthi,Charles Recchia,Andrew Walton,Vilas Sridharan
2024-08-27
Abstract:This paper is a corrigendum to the paper by Beigi et al. published at HPCA 2023 <a class="link-external link-https" href="https://doi.org/10.1109/HPCA56546.2023.10071066" rel="external noopener nofollow">this https URL</a>. The HPCA paper presented a detailed field data analysis of faults observed at scale in DDR4 DRAM from two different memory vendors. This analysis included a breakdown of fault patterns or modes. Upon further study of the data, we found a bug in how we decoded errors based on the logged row-bank-column address. Specifically, we found that some errors that occurred in one column were mis-interpreted as occurring in two non-adjacent columns. As a result of this, some single-bit faults were misclassified as partial-row faults (i.e., two-bit faults). Similarly, some single-column faults were misclassified as two-column faults. The result of these misclassification errors is that the proportion of single-bit faults is higher than reported in the paper, with a commensurate reduction in the fraction of certain types of multi-bit faults. These misclassifications also slightly change the Failure In Time (FIT) per DRAM device values presented in the original paper. In this corrigendum, we provide an updated version of the relevant tables and figures and point out the corresponding page numbers and references in the original paper that they replace.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to correct and update the misclassification issues in the previously published DDR4 DRAM fault analysis. Specifically, there was a decoding error in the original paper, which led to some single - bit faults being misclassified as partial - row faults (i.e., double - bit faults), and some single - column faults being misclassified as double - column faults. This affected the correct statistics of the proportion of fault modes, especially the proportion of single - bit faults and multi - bit faults. ### Main problems and correction contents 1. **Impact of misclassification**: - **Single - bit faults**: Originally underestimated, actually with a higher proportion. - **Multi - bit faults**: Originally overestimated, actually with a lower proportion. 2. **Change in failure rate (FIT)**: - Due to misclassification, the failure rate (FIT) value per DRAM device reported in the original paper also changed. 3. **Specific correction contents**: - **Abstract**: Updated the proportion of multi - bit faults, adjusted from the original value to approximately 23%. - **Introduction**: Modified the proportions of intermittent and permanent failure rates, and updated the description of multi - bit faults. - **Tables and figures**: Updated multiple tables (such as Table I, Table II, Table V) and figures (such as Figure 2, Figure 6) to reflect the correct fault classification and failure rate. - **Conclusion**: Also updated the proportion of multi - bit faults in the conclusion part. ### Formula representation To show these changes more clearly, here are some key formulas involved: - **Proportion of single - bit faults**: \[ P_{\text{single - bit}}=\frac{\text{Number of single - bit faults}}{\text{Total number of faults}} \] - **Proportion of multi - bit faults**: \[ P_{\text{multi - bit}}=\frac{\text{Number of multi - bit faults}}{\text{Total number of faults}} \] - **Failure rate (FIT)**: \[ \text{FIT}=\frac{\text{Number of faults}}{\text{Number of operating hours}\times10^{9}} \] Through these corrections, the paper provides a more accurate analysis of DDR4 DRAM fault modes, ensuring that subsequent research and applications can be carried out based on more reliable data.