Abstract:1. Summary Recent studies estimate that server cost contributes to as much as 57% of the total cost of ownership (TCO) of a datacenter [1]. One key contributor to this high server cost is the procurement of memory devices such as DRAMs, especially for data-intensive datacenter cloud applications that need low latency (such as web search, in-memory caching, and graph traversal). Such memory devices, however, may be prone to hardware errors that occur due to unintended bit flips during device operation [40, 33, 41, 20]. To protect against such errors, traditional systems uniformly employ devices with highquality chips and error correction techniques, both of which increase device cost. At the same time, we make the observations that 1) data-intensive applications exhibit a diverse spectrum of tolerance to memory errors, and 2) traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. Our DSN-44 paper [30] is the first to 1) understand how tolerant different data-intensive applications are to memory errors and 2) design a new memory system organization that matches hardware reliability to application tolerance in order to reduce system cost. The main idea of our approach is to classify applications based on their memory error tolerance, and map applications to heterogeneous-reliability memory system designs managed cooperatively between hardware and software to reduce system cost. Our DSN-44 paper provides the following contributions: 1. A new methodology to quantify the tolerance of applications to memory errors. Our approach measures the effect of memory errors on application correctness and quantifies an application’s ability to mask or recover from memory errors. 2. A comprehensive characterization of the memory error tolerance of three data-intensive workloads: an interactive web search application [30, 39], an in-memory key‐value store [30, 3], and a graph mining framework [30, 29]. We find that there exists an order of magnitude difference in memory error tolerance across these three applications. 3. An exploration of the design space of new memory system organizations, called heterogeneous-reliability memory, which combines a heterogeneous mix of reliability techniques that leverage application error tolerance to reduce system cost. We show that our techniques can reduce server hardware cost by 4.7%, while achieving 99.90% single server availability.

Exploiting Memory Soft Redundancy for Joint Improvement of Error Tolerance and Access Efficiency

Exploiting Soft Redundancy for Error-Resilient On-Chip Memory Design

A Defect-Tolerant Memory Nanoarchitecture Exploiting Hybrid Redundancy

Hybrid Redundancy for Defect Tolerance in Molecular Crossbar Memory

Error-tolerance memory microarchitecture via dynamic multithreading redundancy

Joint Performance Improvement and Error Tolerance for Memory Design Based on Soft Indexing

Design of error-tolerant cache memory for multithreaded computing

Towards achieving reliable and high-performance nanocomputing via dynamic redundancy allocation

Dynamic redundancy allocation for reliable and high-performance nanocomputing

Reducing error accumulation effect in multithreaded memory systems

On the Use of Dram with Unrepaired Weak Cells in Computing Systems

Improving Error Tolerance for Multithreaded Register Files

Implicit Programming: A Fast Programming Strategy for Nand Flash Memory Storage Systems Adopting Redundancy Methods.

On the Characterization and Optimization of On-Chip Cache Reliability Against Soft Errors

Exploiting Asymmetry in Edram Errors for Redundancy-Free Error-Tolerant Design

Architecting High-Performance Energy-Efficient Soft Error Resilient Cache under 3D Integration Technology

Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance.

Register Reallocation for Soft Error Reduction

Exploiting Narrow-Width Values for Improving Non-Volatile Cache Lifetime

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost Via Heterogeneous-Reliability Memory

Realizing Unequal Error Correction for Nand Flash Memory at Minimal Read Latency Overhead