2012
DOI: 10.1007/978-3-642-29740-3_36
|View full text |Cite
|
Sign up to set email alerts
|

Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol

Abstract: Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
3
0

Year Published

2014
2014
2019
2019

Publication Types

Select...
2
2
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 19 publications
0
3
0
Order By: Relevance
“…As systems are approaching billion‐way concurrency at exascale , we argue that the data‐driven programming models will likely employ over‐decomposition to generate more fined‐grained tasks than available parallelism. While over‐decomposition has the ability to improve utilization and fault tolerance at extreme scales , it poses severe challenges on scheduling system to make fast scheduling decisions (e.g., millions/s) and to be available, in order to achieve the best performance. These requirements are far beyond the capability of today's centralized batch scheduling systems.…”
Section: Introductionmentioning
confidence: 99%
“…As systems are approaching billion‐way concurrency at exascale , we argue that the data‐driven programming models will likely employ over‐decomposition to generate more fined‐grained tasks than available parallelism. While over‐decomposition has the ability to improve utilization and fault tolerance at extreme scales , it poses severe challenges on scheduling system to make fast scheduling decisions (e.g., millions/s) and to be available, in order to achieve the best performance. These requirements are far beyond the capability of today's centralized batch scheduling systems.…”
Section: Introductionmentioning
confidence: 99%
“…As systems are growing exponentially in parallelism approaching billion way concurrency at exascale [2], we argue that future programming models will likely employ over-decomposition generating even many more fined-grained tasks than available parallelism. While over-decomposition has been shown to improve utilization at extreme scales as well to make fault tolerance more efficient [3] [4], it poses significant challenges on task scheduling system to make extremely fast scheduling decisions (e.g. millions/sec), in order to achieve the highest throughput and utilization.…”
Section: Introductionmentioning
confidence: 99%
“…This time could be shortened, if only one could store and resubmit the task graph from one timestep to another such as in[2].…”
mentioning
confidence: 99%