Improving DRAM Reliability Using a High Order Error Correction Code

Wei Li,Meng Zhang,Tianwei Gui,Zheng Fang,Changsheng Xie,Fei Wu
DOI: https://doi.org/10.1109/tcad.2024.3400677
IF: 2.9
2024-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Dynamic random access memory (DRAM) is being upgraded iteratively, and as a result, its transmission rate and bandwidth are rising quickly. Simultaneously, as the DRAM process has advanced, the storage cell size has decreased and cell integration has improved within each device, leading to a significant boost in storage capacity and density. DRAM has been widely utilized as a crucial storage component in personal computers, mobile devices, servers, and data centers because of these benefits. However, data reliability is greatly hampered by DRAM’s vulnerability to single-bit, row, and column errors, which result in data loss and corruption as well as the possibility of system crashes and downtime. Error correction codes (ECC) are used by DRAM to protect data and increase reliability, but because large capacity DRAM is more prone to multi-bit errors of cross-chip. Traditional error correction strategies are unable to keep up with the demand for multi-bit errors of cross-chip. Therefore, a crucial problem that needs to be solved is the design of an ECC strategy with robust error correction capabilities. A high order ECC scheme with stronger error correcting capability is developed at a higher firmware layer without changing the hardware architecture to address reliability issues brought by DRAM multi-bit errors of cross-chip. The higher order ECC technique is then used to gain a stronger error correction capability while minimizing the latency overhead when an uncorrectable error is discovered by rank-level ECC (RECC). The error correction performance of the proposed high order ECC algorithm is evaluated and verified using simulation experiments in terms of both error correction capability and encoding/decoding latency. Simulation results show that compared with existing ECC schemes, the proposed high order ECC scheme for DRAM reduces latency by 69% and storage overhead by 5.56%. The proposed high order ECC method has significant research implications and is useful in preventing data loss and enhancing DRAM reliability.
What problem does this paper attempt to address?