2010
DOI: 10.2172/989384
|View full text |Cite
|
Sign up to set email alerts
|

Lightweight storage and overlay networks for fault tolerance.

Abstract: The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands to millions of processors, In such environments, it is critical to have fault-tolerance mechanisms, including checkpoint/restart, that scale with the size of applications and the percentage of the system on which the applications execute. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a scalable solution. For example, on today's massive-sc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2011
2011
2015
2015

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 8 publications
(13 citation statements)
references
References 29 publications
0
13
0
Order By: Relevance
“…They further make the case that it is likely that aggregate computational power for the foreseeable future will come from just such increases. Their exploration of current and alternative checkpointing techniques and associated pros and cons parallels those of others such as Oldfield [3]. All such work seems to draw the conclusion that current fault tolerance techniques, namely checkpoint and restart, in their current form coupled with the seemingly fixed relationship between number of CPU sockets and MTTI imply that run time efficiency will continue to fall as computational platforms become larger.…”
Section: Related Workmentioning
confidence: 90%
“…They further make the case that it is likely that aggregate computational power for the foreseeable future will come from just such increases. Their exploration of current and alternative checkpointing techniques and associated pros and cons parallels those of others such as Oldfield [3]. All such work seems to draw the conclusion that current fault tolerance techniques, namely checkpoint and restart, in their current form coupled with the seemingly fixed relationship between number of CPU sockets and MTTI imply that run time efficiency will continue to fall as computational platforms become larger.…”
Section: Related Workmentioning
confidence: 90%
“…For example, we developed services for staging checkpoint data [23,24,31], HPC database integration [30], interactive visualization [25], network traffic analysis, and most recently CTH in transit analysis [22]. A recent paper describes these services in detail [17].…”
Section: Nessiementioning
confidence: 99%
“…One approach to both reduce the burden on the file system and improve "effective" I/O rates for the application is to employ the use of "data services" [1,85,97,100,139]. Simply put, a data service is a separate (possibly parallel) application that performs operations on behalf of an actively running scientific application.…”
Section: Background and Motivationmentioning
confidence: 99%
“…On current capability-class HPC systems, services execute on compute nodes or service nodes and provide the application the ability to "offload" operations that present scalability challenges for the scientific code. One commonly used example for data services is data staging, or caching data between the application and the storage system [100,101,118]. Section 7.4 describes such a service.…”
Section: Background and Motivationmentioning
confidence: 99%