2015
DOI: 10.1016/j.procs.2015.05.187
|View full text |Cite|
|
Sign up to set email alerts
|

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(18 citation statements)
references
References 8 publications
0
18
0
Order By: Relevance
“…Checkpointing and selection of checkpoints are important for ensuring end-user productivity in case of failures and recovery. 5,49,50 In the proposed approach, violation of the health rules is based on the assumption that the silent errors are almost instantaneous and affect the simulations immediately so that the rule violation is trigged within a few steps of the simulations. If such an event occurs, a healthy or error-free checkpoint is selected from a state that is considered healthy, based on the defined rules.…”
Section: Selection Of Checkpoint Considerationsmentioning
confidence: 99%
“…Checkpointing and selection of checkpoints are important for ensuring end-user productivity in case of failures and recovery. 5,49,50 In the proposed approach, violation of the health rules is based on the assumption that the silent errors are almost instantaneous and affect the simulations immediately so that the rule violation is trigged within a few steps of the simulations. If such an event occurs, a healthy or error-free checkpoint is selected from a state that is considered healthy, based on the defined rules.…”
Section: Selection Of Checkpoint Considerationsmentioning
confidence: 99%
“…Ali et al [27] use redundant communication to continuously update shadow data structures. Other user-level approaches target arrays [28] as well as the MapReduce [29] and hierarchical master/worker patterns [30]. Task scheduling in grids and clouds differs from our work in coarser-grained tasks and centralization (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…GVR has been used to demonstrate flexible multi-version rollback, forward error correction, and other creative recovery schemes [5,6]. Demonstrations include high-error rates, and results show modest runtime cost (< 1%) and programming effort in fullscale molecular dynamics, Monte Carlo, adaptive mesh, and indirect linear solver applications [7,8].…”
Section: Global View Resilience (Gvr)mentioning
confidence: 99%
“…-Because latent ("silent") errors are complex to identify, the detector is computationally expensive. 7 The interval between two consecutive error detections bounds the error latency. Given the error location and timing, three steps are performed to correct the state of corrupted data.…”
Section: Algorithm-based Focused Recovery (Abfr)mentioning
confidence: 99%