Improving Memory Reliability by Bounding DRAM Faults

Criss, Kjersten; Bains, Kuljit; Agarwal, Rajat; Bennett, Tanj; Grunzke, Terry; Kim, Jangryul Keith; Chung, Hoeju; Jang, Munseon

doi:10.1145/3422575.3422803

Cited by 15 publications

(14 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To combat DRAM-related failures, system designers typically incorporate reliability, availability and serviceability (RAS) features [153][154][155] that collectively improve system reliability beyond what commodity DRAM chips can provide alone. In general, memory RAS is a broad research area with solutions spanning the hardware-so ware stack, ranging from hardware-based mechanisms within the DRAM chip (e.g., on-die ECC scrubbing [11,101,156], postpackage repair [10,11,[157][158][159], target row refresh [100,160]), memory controller (e.g., rank-level ECC [48-55, 57-60, 81], rank-level ECC scrubbing [56, 61, 62, 62-65, 82, 156, 161], repair techniques [22,79,[162][163][164][165][166][167][168][169]) to so ware-only solutions (e.g., page retirement [76,[120][121][122][123][124], failure prediction [170][171][172][173][174][175]).…”

Section: Bene Ts For Dram Consumersmentioning

confidence: 99%

“…In Step 2, we propose extending DRAM standards with explicit DRAM reliability standards that provide industrystandard guarantees, tools, and/or information helpful to consumers. We envision di erent possibilities for these reliability standards, including (1) reliability guarantees for how a chip is expected to behave under certain operating conditions (e.g., predictable behavior of faults [101]); (2) disclosure of industry-validated DRAM reliability models and testing strategies suitable for commodity DRAM chips (e.g., similar to how JEDEC JEP122 [102], JESD218 [103], and JESD219 [104] address Flash-memory-speci c error mechanisms [105][106][107] such as oating-gate data retention [108][109][110][111] and models for physical phenomena such as threshold voltage distributions [112][113][114][115]); and (3) requirements for manufacturers to directly provide relevant information about their DRAM chips (e.g., the information requested in Step 1). As the DRAM industry continues to evolve, we anticipate closer collaboration between DRAM and system designers to e ciently overcome the technology scaling challenges that DRAM is already facing [26,28,116,117].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Case for Transparent Reliability in DRAM Systems

Patel¹,

Shahroodi²,

Manglik³

et al. 2022

Preprint

View full text Add to dashboard Cite

Mass-produced commodity DRAM is the preferred choice of main memory for a broad range of computing systems due to its favorable cost-per-bit. However, today's systems have diverse system-speci c needs (e.g., performance, energy, reliability) that are di cult to address using one-size-ts-all generalpurpose DRAM. Unfortunately, although system designers can theoretically adapt commodity DRAM chips to meet their particular design goals (e.g., by exploiting slack in access timings to improve performance, or implementing system-level RowHammer mitigations), we observe that designers today lack the necessary insight into commodity DRAM chips' reliability characteristics to implement these techniques in practice. In this work, we make a case for DRAM manufacturers to provide increased transparency into simple device characteristics (e.g., internal row address mapping, cell array organization) that a ect consumer-visible reliability. Doing so has negligible impact on manufacturers given that these characteristics can be reverse-engineered using known techniques; however, it has signi cant bene t for system designers, who can then make informed decisions to be er adapt commodity DRAM to meet modern systems' needs while preserving its cost advantages.To support our argument, we study four ways that system designers can adapt commodity DRAM chips to system-speci c design goals: (1) improving DRAM reliability; (2) reducing DRAM refresh overheads; (3) reducing DRAM access latency; and (4) defending against RowHammer a acks. We observe that adopting solutions for any of the four goals requires system designers to make assumptions about a DRAM chip's reliability characteristics. ese assumptions discourage system designers from using such solutions in practice due to the di culty of both making and relying upon the assumption.We identify DRAM standards as the root of the problem: current standards rigidly enforce a xed operating point with no speci cations for how a system designer might explore alternative operating points. To overcome this problem, we introduce a two-step approach that reevaluates DRAM standards with a focus on transparency of reliability characteristics so that system designers are encouraged to make the most of commodity DRAM technology for both current and future DRAM chips.

show abstract

Section: Bene Ts For Dram Consumersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Case for Transparent Reliability in DRAM Systems

Patel¹,

Shahroodi²,

Manglik³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…On-die ECC addresses uncorrelated single-bit errors that limit a manufacturers' factory yield [21,43,60,74,121,145,146,162] and is already prevalent among commodity DRAM chips today. Therefore, it is imperative that system-level error-mitigation mechanisms take on-die ECC into account, as clearly motivated by several prior works [21,32,43,69,137,145,162]. [18-20, 113, 114], main memory is generally designed separately from the memory controller [130].…”

Section: Addressing Scaling-related Errorsmentioning

confidence: 99%

“…Unfortunately, this separation discourages building a unified error-mitigation mechanism across the memory and its controller. This is exemplified by the widespread use of proprietary DRAM on-die ECC, which introduces new reliability challenges for designing error mitigation mechanisms within the DRAM controller [21,32,43,137,145,162]. In general, the standardized interface between the memory and the controller (e.g., JEDEC DRAM standards [64,67,68]) must be modified to develop a joint solution, which impacts all manufacturers and consumers involved, and thus is a laborious and long (and often politically-charged) process.…”

Section: Addressing Scaling-related Errorsmentioning

confidence: 99%

“…Unfortunately, on-die ECC changes how memory errors appear outside the memory chip (e.g., to the memory controller or the system software). This introduces new challenges for designing additional error-mitigation mechanisms at the system level [21,32,43,69,115,137,142,162] or test a memory chip's reliability characteristics [41,44,145,146].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HARP: Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes

Patel

Oliveira

Mutlu

2021

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Aggressive storage density scaling in modern main memories causes increasing error rates that are addressed using error-mitigation techniques. State-of-the-art techniques for addressing high error rates identify and repair bits that are at risk of error from within the memory controller. Unfortunately, modern main memory chips internally use on-die error correcting codes (on-die ECC) that obfuscate the memory controller's view of errors, complicating the process of identifying at-risk bits (i.e., error profiling).To understand the problems that on-die ECC causes for error profiling, we analytically study how on-die ECC changes the way that memory errors appear outside of the memory chip (e.g., to the memory controller). We show that on-die ECC introduces statistical dependence between errors in different bit positions, raising three key challenges for practical and effective error profiling: on-die ECC (1) exponentially increases the number of at-risk bits the profiler must identify; (2) makes individual at-risk bits more difficult to identify; and (3) interferes with commonly-used memory data patterns that are designed to make at-risk bits easier to identify.To address the three challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new error profiling algorithm that rapidly achieves full coverage of at-risk bits based on two key insights. First, errors that on-die ECC fails to correct have two sources:(1) direct errors from raw bit errors in the data portion of the ECC word and (2) indirect errors that on-die ECC introduces when facing uncorrectable errors. Second, the maximum number of indirect errors that can occur concurrently is limited to the correction capability of on-die ECC. HARP's key idea is to first identify all bits at risk of direct errors using existing profiling techniques with the help of small modifications to the on-die ECC mechanism. Then, a secondary ECC within the memory controller with correction capability equal to or greater than that of on-die ECC can safely identify bits at-risk of indirect errors, if and when they fail.We evaluate HARP in simulation relative to two state-of-the-art baseline error profiling algorithms. We show that HARP achieves full coverage of all at-risk bits faster (e.g., 99th-percentile coverage 20.6%/36.4%/52.9%/62.1% faster, on average, given 2/3/4/5 raw bit errors per ECC word) than the baseline algorithms, which sometimes fail to achieve full coverage. We perform a case study of how each

show abstract

New Design of Error Control Codes Resilient to Single Burst Error or Two Random Bit Errors Using Constacyclic Codes

Kim

2022

IEEE Access

View full text Add to dashboard Cite

In this paper, we introduce a new design method of burst error control codes (BECCs), which can correct single burst error or two random bit errors by using the maximum likelihood syndrome decoder (MLSD), where the proposed BECCs are designed by a modification of the well-known Fire codes using constacyclic codes. Also, for the existing low-latency burst error-correcting decoder, it is shown that the proposed BECCs have the single burst error correction capability together with two random bit error detection capability with no additional parity bits and comparable complexity with the existing BECCs. For this, the complexity and latency is numerically analyzed by register-transfer level (RTL) synthesis.INDEX TERMS Burst error control codes (BECCs), burst errors, error control codes, Fire codes, lowlatency decoder

show abstract

Improving Memory Reliability by Bounding DRAM Faults

Cited by 15 publications

References 0 publications

A Case for Transparent Reliability in DRAM Systems

A Case for Transparent Reliability in DRAM Systems

HARP: Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes

New Design of Error Control Codes Resilient to Single Burst Error or Two Random Bit Errors Using Constacyclic Codes

Contact Info

Product

Resources

About