On the Use of Dram with Unrepaired Weak Cells in Computing Systems
Hao Wang,Yin Li,Xuebin Zhang,Xiaoqing Zhao,Hongbin Sun,Tong Zhang
DOI: https://doi.org/10.1145/2989081.2989108
2016-01-01
Abstract:In current practice, DRAM manufacturers apply redundancy repair to decommission all the weak cells that cannot satisfy the target data retention time under the worse-case operational conditions (e.g., the highest operating temperature). However, as the DRAM scaling enters sub-20nm regime, it becomes increasingly challenging to repair all the weak cells at reasonable cost. This work studies how one could use DRAM chips with unrepaired weak cells in computing systems. In particular, this work is based upon the simple idea that OS reserves all the error-prone pages, which contain at least one unrepaired weak cell, from being used. Under a relatively high error-prone page rate (e.g., 8%), this basic idea is subject to two issues: (1) Simply reserving all the error prone pages could make it almost impossible for OS to allocate a continuous fragmentation-free physical memory space for some critical operations such as OS booting and DMA buffering. (2) Since most error-prone pages may only contain few unrepaired weak cells, reserving all the error-prone pages from practical usage could cause noticeable memory resource waste. Aiming to address these issues, this paper presents a controller-based selective page re-mapping strategy to ensure a continuous critical memory region for OS, and develops a software-based memory error tolerance scheme to re-cycle all the error-prone pages for the zRAM function in Linux. Since the first scheme only eliminates the fragmentation in the critical memory region (e.g., 128MB in Linux), the remaining non-critical memory region is still subject to severe fragmentation. Hence, we carried out experiments using SPEC CPU2006 to quantitatively demonstrate that highly fragmented non-critical memory region may not cause significant computing system performance degradation. We further study the latency and hardware cost of implementing the controller-based page re-mapping, and the effectiveness of re-cycling error-prone pages for zRAM in Linux. The experimental results show that our proposed software-based error tolerance scheme degrades the speed performance of zRAM by only up to 7%.