Leonardo Bautista-Gomez scite author profile

Leonardo Bautista-Gomez

5Publications

229Citation Statements Received

44Citation Statements Given

How they've been cited

257

227

How they cite others

Affiliations

Barcelona Supercomputing Center, Argonne National Laboratory, Tokyo Institute of Technology

Publications

Order By: Most citations

Detecting silent data corruption through data dynamic monitoring for scientific applications

Bautista-Gomez

Cappello

2014

SIGPLAN Not.

View full text Add to dashboard Cite

Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead.

show abstract

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications

Bouguerra

Bautista-Gomez

et al. 2014

View full text Add to dashboard Cite

Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer

Bautista-Gomez

Zyulkyarov

Unsal

et al. 2016

View full text Add to dashboard Cite

Abstract-Supercomputers offer new opportunities for scientific computing as they grow in size. However, their growth also poses new challenges. Resilience has been recognized as one of the most pressing issues to solve for extreme scale computing. Transistor scaling in the single-digit nanometer era and power constraints might dramatically increase the failure rate of next generation machines. DRAM errors have been analyzed in the past for different supercomputers but those studies are usually based on job scheduler logs and counters produced by hardwarelevel error correcting codes. Consequently, little is known about errors escaping hardware checks, which lead to silent data corruption. This work attempts to fill that gap by analyzing memory errors for over a year on a cluster with about 1000 nodes featuring low-power memory without error correction. The study gathered millions of events recording detailed information of thousands of memory errors, many of them corrupting multiple bits. Several factors are analyzed, such as temporal and spatial correlation between errors, but also the influence of temperature and even the position of the sun in the sky. The study showed that most multi-bit errors corrupted non-adjacent bits in the memory word and that most errors flipped memory bits from 1 to 0. In addition, we observed thousands of cases of multiple single-bit errors occurring simultaneously in different regions of the memory. These new observations would not be possible by simply analyzing error correction counters on classical systems. We propose several directions in which the findings of this study can help the design of more reliable systems in the future.

show abstract

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications

Berrocal

Bautista-Gomez

et al. 2015

View full text Add to dashboard Cite

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show that, using our techniques, we can detect a large number of corruptions (i.e., above 90% in some cases) with 84% memory overhead, and 13.75% extra computation time.

show abstract

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

Bouguerra

Gainaru

Bautista-Gomez

et al. 2013

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.