On Workload-Aware DRAM Failure Prediction in Large-Scale Data Centers

Wang, Xingyi; Yu, Li; Chen, Yiquan; Wang, Shiwen; Yin, Dulin; He, Cheng; Zhang, Yuzhong; Chen, Pinan; Li, Xin; Song, Weihong; Xu, Qiang; Jiang, Li

doi:10.1109/vts50974.2021.9441059

Cited by 7 publications

(1 citation statement)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A data center is composed of large number of nodes, where each node consists of two sockets for the two CPUs. Each CPU has two integrated memory controllers (IMCs) that manage data in and out of memory within multiple channels [19]. An IMC is also called a memory chip controller (MCC) or memory controller unit (MCU).…”

Section: A Memory Structurementioning

confidence: 99%

Review of Memory RAS for Data Centers

Lee,

Kim,

Kim

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Multi-bit error and downtime due to uncorrectable error (UE) in a dual in line memory module (DIMM) have received great attention in data centers for its high repair or replacement cost. These problems can be alleviated by utilizing ECC (Error Correction Code) technology, which enables prompt error correction during initial occurrences and prediction of future UEs based on recurring error patterns. The technologies for addressing errors can be categorized into reliability, availability, and serviceability (RAS), and need to be optimized using various parameters such as accuracy, recall, F-measures, and cost reduction. This paper describes an overview of the current RAS technologies and trends in memory for data centers, which includes an analysis of conventional ECC technologies and their recent developments. Once UEs cannot be completely eliminated with ECCs, page offline methods based on analysis on error patterns and characterization of UE can be performed. Recent research trends for reducing memory capacity wasted by UE and page offline have been towards on-die ECC in high bandwidth memory architecture. INDEX TERMSCorrectable error (CE), error correction code (ECC), memory reliability, availability, serviceability (RAS), and uncorrectable error (UE).

show abstract

Section: A Memory Structurementioning

confidence: 99%