Daniel Oliveira scite author profile

Graphics processing units (GPUs) are increasingly common in both safety-critical and high-performance computing (HPC) applications. Some current supercomputers are composed of thousands of GPUs so the probability of device corruption becomes very high. Moreover, the GPU's parallel capabilities are very attractive for the automotive and aerospace markets, where reliability is a serious concern. In this paper, the neutron sensitivity of the modern GPU caches, and internal resources are experimentally evaluated. Various Duplication With Comparison strategies to reduce GPU radiation sensitivity are then presented and validated through radiation experiments. Threads should be carefully duplicated to avoid undesired errors on shared resources and to avoid the exacerbation of errors in critical resources such as the scheduler.

show abstract

Time-to-Solution and Energy-to-Solution: A Comparison between ARM and Xeon

Padoin

Oliveira²,

Velho

et al. 2012

View full text Add to dashboard Cite

Most High Performance Computing (HPC) systems today are known as "power hungry" because they aim at computing speed regardless to energy consumption. Some scientific applications still claim more speed and the community expects to reach exascale by the end of the decade. Nevertheless, to reach exascale we need to search alternatives to cope with energy constraints. A promising step forward in this direction is the usage of low power processors such as ARM. ARM processors target low power consumption in contrast with Xeon that are conventional on HPC aiming at computing speed. This paper presents a comparison between ARM and Xeon to evaluate if ARM is the future building block to HPC. We choose to use time-to-solution, peak power, and energy-tosolution to evaluate both processors from the user's perspective. The results point that although ARM having lower peak power, Xeon has still a better tradeoff from the user's point-of-view.

show abstract

High-Energy Versus Thermal Neutron Contribution to Processor and Memory Error Rates

Oliveira

Santos

Dávila

et al. 2020

IEEE Trans. Nucl. Sci.

View full text Add to dashboard Cite

Experimental and analytical study of Xeon Phi reliability

Oliveira

Pilla

DeBardeleben

et al. 2017

View full text Add to dashboard Cite

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%.We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

show abstract

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Oliveira

Pilla

Hanzich

et al. 2017

View full text Add to dashboard Cite

Abstract-In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daniel Oliveira

Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Code-Dependent and Architecture-Dependent Reliability Behaviors

Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison

Time-to-Solution and Energy-to-Solution: A Comparison between ARM and Xeon

High-Energy Versus Thermal Neutron Contribution to Processor and Memory Error Rates

Experimental and analytical study of Xeon Phi reliability

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Contact Info

Product

Resources

About