Fault Modeling of Extreme Scale Applications Using Machine Learning

Vishnu, Abhinav; Dam, Hubertus J. J. van; Tallent, Nathan R.; Kerbyson, Darren J.; Hoisie, Adolfy

doi:10.1109/ipdps.2016.111

Cited by 19 publications

(15 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some works use collaborative filtering to colocate tasks in clouds by estimating application interference [30]. Others are closer to the application level and use binary classification to distinguish benign memory faults from application errors in order to execute recovery algorithms (see [31] for instance).…”

Section: Data-aware Resource Managementmentioning

confidence: 99%

Tuning EASY-Backfilling Queues

Lelong

Reis

Trystram

2018

Job Scheduling Strategies for Parallel Processing

View full text Add to dashboard Cite

Abstract. EASY-Backfilling is a popular scheduling heuristic for allocating jobs in large scale High Performance Computing platforms. While its aggressive reservation mechanism is fast and prevents job starvation, it does not try to optimize any scheduling objective per se. We consider in this work the problem of tuning EASY using queue reordering policies. More precisely, we propose to tune the reordering using a simulationbased methodology. For a given system, we choose the policy in order to minimize the average waiting time. This methodology departs from the First-Come, First-Serve rule and introduces a risk on the maximum values of the waiting time, which we control using a queue thresholding mechanism. This new approach is evaluated through a comprehensive experimental campaign on five production logs. In particular, we show that the behavior of the systems under study is stable enough to learn a heuristic that generalizes in a train/test fashion. Indeed, the average waiting time can be reduced consistently (between 11% to 42% for the logs used) compared to EASY, with almost no increase in maximum waiting times. This work departs from previous learning-based approaches and shows that scheduling heuristics for HPC can be learned directly in a policy space.

show abstract

Section: Data-aware Resource Managementmentioning

confidence: 99%

Tuning EASY-Backfilling Queues

Lelong

Reis

Trystram

2018

Job Scheduling Strategies for Parallel Processing

View full text Add to dashboard Cite

show abstract

“…ML-based methods have been used in several ways. Vishnu et al [18] proposed a combination of eight ML algorithms, to assess the impact of multi-bit memory errors on HPC applications and to compare their predictions with the fault-injection results. In [19], the authors used linearregression techniques to backtrack the propagation of soft errors through processes dependent on many-core processor systems.…”

Section: Related Workmentioning

confidence: 99%

Empirical Mathematical Model of Microprocessor Sensitivity and Early Prediction to Proton and Neutron Radiation-Induced Soft Errors

Serrano-Cases

Reyneri

Morilla

et al. 2020

IEEE Trans. Nucl. Sci.

View full text Add to dashboard Cite

A mathematical model is described to predict microprocessor fault tolerance under radiation. The model is empirically trained by combining data from simulated faultinjection campaigns, and radiation experiments, both with protons (at the CNA facilities, Seville, Spain) and with neutrons (at the LANSCE Weapons Neutron Research facility at Los Alamos, USA). The sensitivity to soft errors of different blocks of commercial processors is identified to estimate the reliability of a set of programs that had previously been optimized, hardened, or both. The results showed a standard error under 0.1, in the case of the ARM processor, and 0.12, in the case of the MSP430 microcontroller.

show abstract

“…While this approach focus on the reduction the of total time to complete the fault injection campaigns, our approach aims at correlating large subsets of application profiles and architecture characteristics with fault injection results in order to pinpoint the most relevant parameters/traces on the target system. Vishnu et al [12] evaluate the impact of multi-bit memory errors, both permanent and transient, on HPC applications. This work considers eight different ML algorithms (e.g., support vector machines, k-Nearest neighbors, three distinct decision trees), comparing their predictions (i.e., the error probability) with the ground-truth (i.e., fault injection results).…”

Section: Review Of Fault Injection Approaches Using Virtual Platfmentioning

confidence: 99%

Using Machine Learning Techniques to Evaluate Multicore Soft Error Reliability

Rosa

Garibotti

Ost

et al. 2019

IEEE Trans. Circuits Syst. I

View full text Add to dashboard Cite

Virtual platform frameworks have been extended to allow earlier soft error analysis of more realistic multicore systems (i.e., real software stacks, state-of-the-art ISAs). The high observability and simulation performance of underlying frameworks enable to generate and collect more error/failurerelated data, considering complex software stack configurations, in a reasonable time. When dealing with sizeable failure-related data sets obtained from multiple fault campaigns, it is essential to filter out parameters (i.e., features) without a direct relationship with the system soft error analysis. In this regard, this paper proposes the use of supervised and unsupervised machine learning techniques, aiming to eliminate non-relevant information as well as identify the correlation between fault injection results and application and platform characteristics. This novel approach provides engineers with appropriate means that able are able to investigate new and more efficient fault mitigation techniques. The underlying approach is validated with an extensive data set gathered from more than 1.2 million fault injections, comprising several benchmarks, a Linux OS and parallelization libraries (e.g., MPI, OpenMP), as well as through a realistic automotive case study.

show abstract

Fault Modeling of Extreme Scale Applications Using Machine Learning

Cited by 19 publications

References 28 publications

Tuning EASY-Backfilling Queues

Tuning EASY-Backfilling Queues

Empirical Mathematical Model of Microprocessor Sensitivity and Early Prediction to Proton and Neutron Radiation-Induced Soft Errors

Using Machine Learning Techniques to Evaluate Multicore Soft Error Reliability

Contact Info

Product

Resources

About