Understanding Silent Data Corruption in Processors for Mitigating Its Effects

Shaobu Wang,Guangyan Zhang,Junyu Wei,Yang Wang,Jiesheng Wu,Qingchao Luo
DOI: https://doi.org/10.1145/3690825
IF: 1.444
2024-01-01
ACM Transactions on Architecture and Code Optimization
Abstract:Silent Data Corruption (SDC) in processors can lead to various application-level issues, such as incorrect calculations and even data loss. Since traditional techniques are not effective in detecting these errors, it is very hard to address problems caused by SDCs in processors. For the same reason, knowledge about these SDCs in the wild is limited. In this paper, we conduct an extensive study on CPU SDCs in a large production CPU population, encompassing over one million processors. In addition to collecting overall statistics, we perform a detailed study to understand 1) whether certain processor features are particularly vulnerable and their potential impacts on applications; 2) the reproducibility of CPU SDCs and the triggering conditions (e.g., temperature) of those less reproducible SDCs; and 3) the challenges to mitigate and handle CPU SDCs. We further investigate the implications which our observations obtained from the above researches have, on the SDC fault models, SDC mitigation strategies and the future research fields. In addition, we design an efficient SDC mitigation approach called Farron, which uses prioritized testing to detect highly reproducible SDCs and temperature control to mitigate less reproducible SDCs. Our experimental results indicate that Farron can achieve better coverage of CPU SDCs with lower overall overhead, compared to the baseline used in Alibaba Cloud. This demonstrates that our observations are able to assist in SDC mitigation.
What problem does this paper attempt to address?