Improving 3D DRAM Fault Tolerance Through Weak Cell Aware Error Correction.
Hao Wang,Kai Zhao,Minjie Lv,Xuebin Zhang,Hongbin Sun,Tong Zhang
DOI: https://doi.org/10.1109/tc.2016.2621758
IF: 3.183
2016-01-01
IEEE Transactions on Computers
Abstract:Although the emerging 3D DRAM products can significantly improve the computing system performance, the relatively high cost is one of the most critical issues that prevent their wide real-life adoption. Intuitively, a strong memory fault tolerance can be leveraged to reduce the fabrication cost of DRAM dies, and the total cost will reduce if the fabrication cost saving can off-set the cost overhead of memory fault tolerance. Nevertheless, such a simple concept can be a practically viable option only for 3D DRAM because: (1) The stacked logic die can solely implement memory fault tolerance inside 3D DRAM chips, obviating any changes on the host CPUs and CPU-DRAM interfaces. (2) With the total ownership of both the logic die and DRAM dies inside 3D DRAM chips, DRAM manufacturers can fully exploit the potential to truly minimize the 3D DRAM bit cost. Following this intuition, we developed a 3D DRAM fault tolerance design strategy. It can achieve a very strong tolerance to weak DRAM cells at very small redundancy and latency overhead. The key is to cohesively leverage the detectability of weak cells and runtime configurability of error correction code (ECC) decoding. In addition, this design strategy can gracefully embrace the inaccuracy of weak cell detection (e. g., weak cell miss-detection and false-detection). We carried out thorough mathematical analysis, and the results show that, under the redundancy overhead of 1: 8 (same as today's ECC DIMM), this design strategy can tolerate the weak cell rate of as high as 10(-4) and 6 x 10(-5) if 100 and 90 percent of all the weak cells are known in prior. Using Micron's hybrid memory cube (HMC) 3D DRAM chips as the test vehicle, we evaluated the implementation cost and the results show that it only consumes less than 0.4 mm(2) (45 nm node) on the logic die. Using CPU and DRAM simulators, we further carried out simulations over a variety of computing benchmarks and the results show that this design solution only incurs less than 2 percent performance degradation on average.