Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis 2009
DOI: 10.1145/1654059.1654117
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Abstract: The scalability of future massively parallel processing (MPP) systems is being severely challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technolo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
98
0

Year Published

2010
2010
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 124 publications
(99 citation statements)
references
References 22 publications
1
98
0
Order By: Relevance
“…In addition, to avoid excessively delaying LLC misses due to row conflicts while migrating, the PCM DIMMs are equipped with an extra pair of row-buffers per rank, used exclusively for migrations. Operated by the MC, these buffers communicate with the internal prefetching circuitry of the PCM DIMM [11,12], bypassing the original bank's row buffer. Since our migrations occur in sequence, two of these buffers are necessary only when the migration involves two banks of the same rank, and one buffer would suffice otherwise.…”
Section: Rank-based Page Placementmentioning
confidence: 99%
“…In addition, to avoid excessively delaying LLC misses due to row conflicts while migrating, the PCM DIMMs are equipped with an extra pair of row-buffers per rank, used exclusively for migrations. Operated by the MC, these buffers communicate with the internal prefetching circuitry of the PCM DIMM [11,12], bypassing the original bank's row buffer. Since our migrations occur in sequence, two of these buffers are necessary only when the migration involves two banks of the same rank, and one buffer would suffice otherwise.…”
Section: Rank-based Page Placementmentioning
confidence: 99%
“…Recent studies [6], [7] estimate that the annual increase in memory size and network bandwidth is 41% and 26%, respectively. Figure 1 shows the trends in both memory size and network bandwidth for the period between 2008 and 2020.…”
Section: Motivationmentioning
confidence: 99%
“…Our approach differentiates from theirs in that we provide techniques to reduce the interference of checkpoint for distributed memory clusters. Dong et al leverage PCRAM [7] for checkpointing and propose the hybrid local/global checkpointing mechanism. Their approach can be incorporated with the semi-blocking algorithm by relaxing the stall of computation when taking global checkpoint.…”
Section: Related Workmentioning
confidence: 99%
“…We then compare the efficiency of Euripus to three systems: 1) one that creates undo-log checkpoints every 10ms but redo logs every every 1 hour (UndoLog+RL1h), 2) another that creates only Euripus's redo-log checkpoints (RedoLog), and 3) one that only creates redo logs every 1 hour (RedoLog 1h). We assume that all redo logs and a full checkpoint are stored in PCM, and obtain checkpoint-restore times from PCM through simulation (1s for a full checkpoint, 1.5 seconds for minutes-level, and 1.75s for seconds-level checkpoint 5 ). The base error rate of the system was estimated to be 10 −8 from field data [13,20].…”
Section: Error Recoverymentioning
confidence: 99%
“…Creating checkpoints infrequently, e.g. every 1 hour, has the worst efficiency, because the error frequency is higher than the checkpointing one, and the 5 Note that rollback to a incremental redo log starts by restoring the previous full checkpoint. 6 The error rate r i at level i be r i = α · r i−1 , where α ≤ 1 and r total = P l i=0 r i = P l i=0 rα i system cannot effectively recover from an error.…”
Section: Error Recoverymentioning
confidence: 99%