MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture 2021
DOI: 10.1145/3466752.3480061
|View full text |Cite
|
Sign up to set email alerts
|

HARP: Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes

Abstract: Aggressive storage density scaling in modern main memories causes increasing error rates that are addressed using error-mitigation techniques. State-of-the-art techniques for addressing high error rates identify and repair bits that are at risk of error from within the memory controller. Unfortunately, modern main memory chips internally use on-die error correcting codes (on-die ECC) that obfuscate the memory controller's view of errors, complicating the process of identifying at-risk bits (i.e., error profili… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

3
5

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 159 publications
0
10
0
Order By: Relevance
“…Fourth, we test DRAM modules that do not implement error correction codes (ECC) [12,21,36,41,67,124]. Doing so ensures that neither on-die [46,104,[113][114][115] nor rank-level [21,67] ECC can alter the RowHammer bit flips we observe and analyze. Fifth, we prevent known on-DRAM-die RowHammer defenses (i.e., TRR [52,55,84,93]) from working by not issuing refresh commands throughout our tests [27,71].…”
Section: Testing Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…Fourth, we test DRAM modules that do not implement error correction codes (ECC) [12,21,36,41,67,124]. Doing so ensures that neither on-die [46,104,[113][114][115] nor rank-level [21,67] ECC can alter the RowHammer bit flips we observe and analyze. Fifth, we prevent known on-DRAM-die RowHammer defenses (i.e., TRR [52,55,84,93]) from working by not issuing refresh commands throughout our tests [27,71].…”
Section: Testing Methodologymentioning
confidence: 99%
“…15 and 16 on spatial variation of HC first across subarrays can be leveraged to reduce the time required to profile a given DRAM module's RowHammer vulnerability characteristics. This is an important challenge because profiling a DRAM module's RowHammer characteristics requires analyzing several environmental conditions and attack properties (e.g., data pattern, access pattern, and temperature), requiring time-consuming tests that lead to long profiling times [20,27,71,72,78,110,111,113,166]. According to our Obsvs.…”
Section: Potential Defense Improvementsmentioning
confidence: 99%
“…To exacerbate the problem of identifying a de nitive error model, DRAM manufacturers are starting to incorporate two on-die error-mitigation mechanisms that correct a limited number of errors from within the DRAM chip itself: (1) on-die ECC [28,54,95,[254][255][256][257][258] for improving reliability and yield and (2) target row refresh [100,160,222,239] for partially mitigating the RowHammer vulnerability. Prior works on ECC [27,30,54,95,101,258,259,296,[320][321][322][323][324] and RowHammer [92,100,160,226] show that both on-die ECC and TRR change how errors appear outside of the DRAM chip, thereby changing the DRAM error model seen by the memory controller (and therefore, to the rest of the system). Unfortunately, both mechanisms are opaque to the memory controller and are considered trade secrets that DRAM manufacturers will not ofcially disclose [22,23,92,93,95,226,258,298].…”
Section: Lack Of Transparency In Commodity Drammentioning
confidence: 99%
“…Prior works propose two practical ways of identifying retentionweak cells: (1) active pro ling, which uses comprehensive tests to search for error-prone cells o ine [77-79, 127, 129, 135], and (2) reactive pro ling, which constantly monitors memory to identify errors as they manifest during runtime, e.g., ECC scrubbing [56,61,82]. Both approaches require the pro ler to understand the worst-case behavior of data-retention errors for a given DRAM chip [79,127]: an active pro ler must use the worst-case conditions to maximize the proportion of retention-weak cells it identi es during pro ling [78] and a reactive pro ler must be provisioned to identify (and possibly also mitigate) the worst-case error pa ern(s) that might be observed at runtime, e.g., to choose an appropriate ECC detection and correction capability [127,226,324].…”
Section: Lack Of Transparency In Commodity Drammentioning
confidence: 99%
“…Further, we find that 1) over 99.9% of the DRAM rows are vulnerable (i.e., have at least one bit flip) to the new access patterns and 2) the new access patterns cause up to 9.4 million bit flips per DRAM bank. The large number of RowHammer bit flips caused by our specialized access patterns has significant implications for systems protected by Error Correction Codes (ECC) [47,92,93,95]. Our analysis shows that the U-TRR-discovered access patterns can cause up to 7 bit flips at arbitrary locations in one 8-byte dataword, suggesting that typical ECC schemes capable of correcting one error/symbol and detecting two errors/symbols (e.g., SECDED ECC [10,37,43,60,61,79,87,118] and Chipkill [2,20,86]) cannot provide sufficient protection against RowHammer even in the presence of TRR mechanisms.…”
Section: Introductionmentioning
confidence: 99%