Towards Error Correction for Computing in Racetrack Memory

Preston Brazzle,Benjamin F. Morris III,Evan McKinney,Peipei Zhou,Jingtong Hu,Asif Ali Khan,Alex K. Jones
2024-07-31
Abstract:Computing-in-memory (CIM) promises to alleviate the Von Neumann bottleneck and accelerate data-intensive applications. Depending on the underlying technology and configuration, CIM enables implementing compute primitives in place, such as multiplication, search operations, and bulk bitwise logic operations. Emerging nonvolatile memory technologies such as spintronic Racetrack memory (RTM) promise not only unprecedented density but also significant parallelism through CIM. However, most CIM designs, including those based on RTM, exhibit high fault rates. Existing error correction codes (ECC) are not homomorphic over bitwise operations such as AND and OR, and hence cannot protect against CIM faults. This paper proposes CIRM-ECC, a technique to protect spintronic RTMs against CIM faults. At the core of CIRM-ECC, we use a recently proposed RTM-based CIM approach and leverage its peripheral circuitry to our implement our novel ECC codes. We show that CIRM-ECC can be applied to single-bit Hamming codes as well as multi-bit BCH codes.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to implement effective error correction in the Computing - in - Memory (CIM) system based on Racetrack Memory (RTM) to reduce the impact of high failure rates on computational results**. Specifically, traditional Error Correction Codes (ECC) are not very effective when dealing with bit operations (such as AND and OR) in CIM, because these operations are not homomorphic, that is, ECC cannot directly protect these operations. In addition, existing redundancy methods (such as n - modular redundancy) can improve reliability, but will significantly reduce parallelism and performance. ### Main problems: 1. **High failure rate**: CIM designs based on RTM usually have a high failure rate, which will affect the accuracy of computational results. 2. **Traditional ECC not applicable**: Traditional ECC (such as Hamming code and BCH code) cannot effectively protect bit operations such as AND and OR because they are not homomorphic. 3. **Limitations of redundancy methods**: Using redundancy methods (such as n - modular redundancy) can improve reliability, but will greatly reduce the parallelism and performance of the system. ### Solutions: The paper proposes a new error - correction technique - **CIRM - ECC** (Computing In Racetrack Memory - Error Correction Coding), aiming to protect CIM operations based on RTM from faults. The core idea of CIRM - ECC is to use the homomorphic property of XOR operations to detect and correct errors in other logical operations (such as AND and OR). ### Key points: - **Homomorphic property of XOR**: Many ECC schemes (such as Hamming and BCH codes) are homomorphic with respect to XOR operations, so errors can be detected and corrected through XOR operations. - **TR operation**: CIRM - ECC uses the Transverse Read (TR) operation in RTM to perform bit operations and uses XOR operations to detect and correct faults. - **Single - level fault detection**: CIRM - ECC focuses on detecting and correcting single - level faults, because these faults are the most common and can be effectively detected by XOR operations. Through this method, CIRM - ECC can significantly reduce the uncorrectable failure rate while maintaining high performance, thereby providing more reliable computing power for CIM systems based on RTM.