Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Christopher Weaver,Joel Emer,Shubhendu S. Mukherjee,Steven K. Reinhardt
DOI: https://doi.org/10.1145/1028176.1006723
2004-03-02
ACM SIGARCH Computer Architecture News
Abstract:Transient faults due to neutron and alpha particle strikes posea significant obstacle to increasing processor transistor counts infuture technologies. Although fault rates of individual transistorsmay not rise significantly, incorporating more transistors into adevice makes that device more likely to encounter a fault. Hence,maintaining processor error rates at acceptable levels will requireincreasing design effort.This paper proposes two simple approaches to reduce errorrates and evaluates their application to a microprocessor instructionqueue. The first technique reduces the time instructions sit invulnerable storage structures by selectively squashing instructionswhen long delays are encountered. A fault is less likely to cause anerror if the structure it affects does not contain valid instructions.We introduce a new metric, MITF (Mean Instructions To Failure),to capture the trade-off between performance and reliability introducedby this approach.The second technique addresses false detected errors. In theabsence of a fault detection mechanism, such errors would nothave affected the final outcome of a program. For example, a faultaffecting the result of a dynamically dead instruction would notchange the final program output, but could still be flagged by thehardware as an error. To avoid signalling such false errors, wemodify a pipeline's error detection logic to mark affected instructionsand data as possibly incorrect rather than immediately signalingan error. Then, we signal an error only if we determine laterthat the possibly incorrect value could have affected the program'soutput.
What problem does this paper attempt to address?