Fault-Tolerance Techniques for High-Performance Computing

doi:10.1007/978-3-319-20943-2

Cited by 70 publications

(1 citation statement)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Checkpoint-rollback recovery is a straightforward and popular black-box solution to recover from faults in simulation software [42,43,44]. During run time, the software regularly creates snapshots of the simulation data.…”

Section: Checkpointing and Resiliencementioning

confidence: 99%

waLBerla: A block-structured high-performance framework for multiphysics simulations

Bauer

Eibl

Godenschwager

et al. 2021

Computers & Mathematics with Applications

View full text Add to dashboard Cite

Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building blocks for developing simulations on block-structured grids. The block-structured domain partitioning is flexible enough to handle complex geometries, while the structured grid within each block allows for highly efficient implementations of stencil-based algorithms. We present several example applications realized with waLBerla, ranging from lattice Boltzmann methods to rigid particle simulations. Most importantly, these methods can be coupled together, enabling multiphysics simulations. The framework uses meta-programming techniques to generate highly efficient code for CPUs and GPUs from a symbolic method formulation. To ensure software quality and performance portability, a continuous integration toolchain automatically runs an extensive test suite encompassing multiple compilers, hardware architectures, and software configurations.

show abstract

Section: Checkpointing and Resiliencementioning

confidence: 99%

waLBerla: A block-structured high-performance framework for multiphysics simulations

Bauer

Eibl

Godenschwager

et al. 2021

Computers & Mathematics with Applications

View full text Add to dashboard Cite

show abstract

Failure analysis and prediction for the CIPRES science gateway

Singh¹,

Smallen

Tilak

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Science gateways promote collaboration among researchers by providing them with access to communitydeveloped tools and data collections. The Cyberinfrastructure for Phylogenetic Research (CIPRES) science gateway is one of the most popular gateways, with approximately 3000 active users since 2012, and the user base is growing each year. While increasing the number of compute resources available to CIPRES would address their growth needs, it also introduces additional complexity as the likelihood of failure increases. In this paper, we analyze historical job data from CIPRES and combine it with historical software and services monitoring data to create a machine learning model to predict where a user's job will complete successfully on resources. At one operating point of our classifier, we are able to detect 50% of jobs that will fail with a false detection rate less than 5%. In 2014, accurately predicting 50% of CIPRES job failures and redirecting them to other resources would have resulted in 900K compute core hours saved, furthering phylogenetic research. These statistical models will also be used as a base to build a more generic automated monitoring analysis service for science gateways.From the XDCDB, we use the wall-clock-time duration of a job's execution (wall duration), the size of the job in number of nodes used (node count), and processors used (processors); each field is encoded by a single numeric feature. We also use the type of queue to which new jobs are sent

show abstract

Improving batch schedulers with node stealing for failed jobs

Du,

Marchal,

Pallez

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

SummaryAfter a machine failure, batch schedulers typically re‐schedule the job that failed with a high priority. This is fair for the failed job but still requires that job to re‐enter the submission queue and to wait for enough resources to become available. The waiting time can be very long when the job is large and the platform highly loaded, as is the case with typical HPC platforms. We propose another strategy: when a job fails, if no platform node is available, we steal one node from another job , and use it to continue the execution of despite the failure. In this work, we give a detailed assessment of this node stealing strategy using traces from the Mira supercomputer at Argonne National Laboratory. The main conclusion is that node stealing improves the utilization of the platform and dramatically reduces the flow of large jobs, at the price of slightly increasing the flow of small jobs.

show abstract

Fault-Tolerance Techniques for High-Performance Computing

Cited by 70 publications

References 55 publications

waLBerla: A block-structured high-performance framework for multiphysics simulations

waLBerla: A block-structured high-performance framework for multiphysics simulations

Failure analysis and prediction for the CIPRES science gateway

Improving batch schedulers with node stealing for failed jobs

Contact Info

Product

Resources

About