2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks 2015
DOI: 10.1109/dsn.2015.57
|View full text |Cite
|
Sign up to set email alerts
|

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Abstract: Abstract-Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have shown, failures in DRAM devices are an important source of errors in modern servers. To reduce the effects of memory errors, error correcting codes (ECC) have been developed to help detect and correct errors when they occur. In order to develop effective techniques, including new ECC mechanisms, to combat memory errors, it is important to understand the memory reliability trends in modern systems.In this pape… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
156
0

Year Published

2015
2015
2020
2020

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 172 publications
(157 citation statements)
references
References 28 publications
1
156
0
Order By: Relevance
“…Although some of these studies utilize workload inherent features, such as the memory reuse time, they are orthogonal to our work. Predictive maintenance and statistical prediction of errors: Considerable research has been done on statistical pre-diction of different types of hardware faults, including DRAM errors, in supercomputers [4], [18], [35], [38], [44], [58], [66], [89]. The majority of these studies proposed different techniques, based either on rules [38] or Machine Learning [66], for prediction of failures that may happen in various hardware components using history of errors.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Although some of these studies utilize workload inherent features, such as the memory reuse time, they are orthogonal to our work. Predictive maintenance and statistical prediction of errors: Considerable research has been done on statistical pre-diction of different types of hardware faults, including DRAM errors, in supercomputers [4], [18], [35], [38], [44], [58], [66], [89]. The majority of these studies proposed different techniques, based either on rules [38] or Machine Learning [66], for prediction of failures that may happen in various hardware components using history of errors.…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, even though they tried to consider other workload/architecture related factors, this was limited due to the constrained access to only specific features, like percentage of utilized memory, average CPU utilization and hardware characteristics [44]. The joint consideration of more features may reveal new non-linear behaviors that cannot be captured by linear regression models [44] or traditional workload-agnostic statistical models [31]. In addition, all these studies lacked an adequate number of samples because of the rare manifestation of errors for DRAM operating under nominal circuit parameters, which may result in contradictory observations [44], [67].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…As a main memory of large-scale computing systems, dynamic random access memories (DRAMs) have been widely used, and have shown a remarkable reliability from the appllication point of view, though mostly owing to the implementation of redundancy and error correction coding (ECC). In modern systems employing DRAMs, however, a growing memory error rate is observed, which depends on the device count; in addition, a significant amount of faults is observed in the memory controllers and transmission channels, [20]. The aforementioned types of faults may affect the neural network itself or the memory in which parameters are temporarily stored.…”
Section: Fault Tolerance Analysismentioning
confidence: 99%
“…For empirical work and motivation of consideration of errors and other tail events in large-scale infrastructure, cf. Tiwari, Gupta, Gallarno, Rogers, and Maxwell [106], Di Martino, Kalbarczyk, Iyer, Baccanico, Fullop, and Kramer [79], Schroeder, Pinheiro, and Weber [99,100], Meza, Wu, Kumar, and Mutlu [83], Herault and Robert [62], and Barroso, Clidaras, and Hölzle [17].…”
Section: The Case For Tolerance Against Errorsmentioning
confidence: 99%