2010 39th International Conference on Parallel Processing 2010
DOI: 10.1109/icpp.2010.80
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
30
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 37 publications
(30 citation statements)
references
References 22 publications
0
30
0
Order By: Relevance
“…However, it worth noting that such result will be impossible to use in the context of HPC due the huge number of processors involved in the computation given that the computational complexity of the algorithm provided by Plank et al is exponential with respect to the number of processors. In [19], Jin et al provided an analytical model to express the expected completion time in presence of failures with respect to the number of the involved processors in the computation and the amount of workload to execute. Using the queuing theory and by applying Taylor expansion to the expression of the expected completion time of the application with respect to the involved processors in the computation they provided a first order approximation to the optimal checkpoint interval considering that the checkpoint cost is constant.…”
Section: Related Workmentioning
confidence: 99%
“…However, it worth noting that such result will be impossible to use in the context of HPC due the huge number of processors involved in the computation given that the computational complexity of the algorithm provided by Plank et al is exponential with respect to the number of processors. In [19], Jin et al provided an analytical model to express the expected completion time in presence of failures with respect to the number of the involved processors in the computation and the amount of workload to execute. Using the queuing theory and by applying Taylor expansion to the expression of the expected completion time of the application with respect to the involved processors in the computation they provided a first order approximation to the optimal checkpoint interval considering that the checkpoint cost is constant.…”
Section: Related Workmentioning
confidence: 99%
“…Balaprakash et al [31] address the question of optimizing the checkpoint intervals while also considering the energy consumption of a multilevel scheme. As with the models for the coordinated schemes [26], [27] and [28], these hierarchical models cannot be used for non-hierarchical unified uncoordinated and coordinated checkpointing systems such as combined task-level and system-wide checkpointing. The unified model proposed by Bosilca et al [32] models a range of checkpointing systems from fully coordinated scheme on one extreme to partially coordinated hierarchical schemes on the other extreme.…”
Section: Related Workmentioning
confidence: 97%
“…Young and Daly study the sequential jobs [12] [13]. For parallel jobs, studies such as [26], [27] and [28] model the coordinated checkpointing protocols.…”
Section: Related Workmentioning
confidence: 99%
“…With the data distribution, I/O interconnect usage, and access pattern of applications provided by the DRA component, the DDC component coordinates the I/O accesses to manage the substantial amount of concurrency and to mitigate the contention issue on exascale systems. The DDC component orchestrates the I/O requests in both independent I/O and collective I/O operations, which have been observed with serious contention issue at a large scale [CSTS10,JiCS10]. The dynamic coordination for both independent and collective I/O is discussed in the following subsections.…”
Section: Dynamically Coordinated I/o Architecturementioning
confidence: 99%