Lightweight storage and overlay networks for fault tolerance.

Oldfield, Ron A.

doi:10.2172/989384

Cited by 8 publications

(13 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They further make the case that it is likely that aggregate computational power for the foreseeable future will come from just such increases. Their exploration of current and alternative checkpointing techniques and associated pros and cons parallels those of others such as Oldfield [3]. All such work seems to draw the conclusion that current fault tolerance techniques, namely checkpoint and restart, in their current form coupled with the seemingly fixed relationship between number of CPU sockets and MTTI imply that run time efficiency will continue to fall as computational platforms become larger.…”

Section: Related Workmentioning

confidence: 90%

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Brandt

Debusschere

Gentile

et al. 2008

2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)

View full text Add to dashboard Cite

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.

show abstract

Section: Related Workmentioning

confidence: 90%

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Brandt

Debusschere

Gentile

et al. 2008

2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)

View full text Add to dashboard Cite

show abstract

“…For example, we developed services for staging checkpoint data [23,24,31], HPC database integration [30], interactive visualization [25], network traffic analysis, and most recently CTH in transit analysis [22]. A recent paper describes these services in detail [17].…”

Section: Nessiementioning

confidence: 99%

Data co-processing for extreme scale analysis level II ASC milestone (4745).

Rogers¹,

Moreland²,

Oldfield³

et al. 2013

Self Cite

View full text Add to dashboard Cite

Exascale supercomputing will embody many revolutionary changes in the hardware and software of high-performance computing. A particularly pressing issue is gaining insight into the science behind the exascale computations. Power and I/O speed constraints will fundamentally change current visualization and analysis workflows. A traditional post-processing workflow involves storing simulation results to disk and later retrieving them for visualization and data analysis. However, at exascale, scientists and analysts will need a range of options for moving data to persistent storage, as the current offline or post-processing pipelines will not be able to capture the data necessary for data analysis of these extreme scale simulations. This Milestone explores two alternate workflows, characterized as in situ and in transit, and compares them. We find each to have its own merits and faults, and we provide information to help pick the best option for a particular use.

show abstract

“…One approach to both reduce the burden on the file system and improve "effective" I/O rates for the application is to employ the use of "data services" [1,85,97,100,139]. Simply put, a data service is a separate (possibly parallel) application that performs operations on behalf of an actively running scientific application.…”

Section: Background and Motivationmentioning

confidence: 99%

“…On current capability-class HPC systems, services execute on compute nodes or service nodes and provide the application the ability to "offload" operations that present scalability challenges for the scientific code. One commonly used example for data services is data staging, or caching data between the application and the storage system [100,101,118]. Section 7.4 describes such a service.…”

Section: Background and Motivationmentioning

confidence: 99%

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Curry¹,

Ferreira²,

Pedretti³

et al. 2012

Self Cite

View full text Add to dashboard Cite

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. 3 AcknowledgmentsThe authors would like to thank the Red Storm, Cielo, and Cielo Del Sur operations teams for their support during the dedicated experiments. These systems are valuable resources and the teams' prompt response ensured that we maximized our time on them.

show abstract

Lightweight storage and overlay networks for fault tolerance.

Cited by 8 publications

References 29 publications

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Data co-processing for extreme scale analysis level II ASC milestone (4745).

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Contact Info

Product

Resources

About