Vilas Sridharan scite author profile

The Architectural Vulnerability Factor (AVF) of a hardware structure is the probability that a fault in the structure will affect the output of a program. AVF captures both microarchitectural and architectural fault masking effects; therefore, AVF measurements cannot generate insight into the vulnerability of software independent of hardware. To evaluate the behavior of software in the presence of hardware faults, we must isolate the software-dependent (architecture-level masking) portion of AVF from the hardware-dependent (microarchitecture-level masking) portion, providing a quantitative basis to make reliability decisions about software independent of hardware.In this work, we demonstrate that the new Program Vulnerability Factor (PVF) metric provides such a basis: PVF captures the architecture-level fault masking inherent in a program, allowing software designers to make quantitative statements about a program's tolerance to soft errors. PVF can also explain the AVF behavior of a program when executed on hardware; PVF captures the workload-driven changes in AVF for all structures. Finally, we demonstrate two practical uses for PVF: choosing algorithms and compiler optimizations to reduce a program's failure rate.

show abstract

Balancing Performance and Reliability in the Memory Hierarchy

Asadi

Sridharan

Tahoori

et al. 2005

View full text Add to dashboard Cite

show abstract

XED: Exposing On-Die Error Detection Information for Strong Memory Reliability

Nair¹,

Sridharan²,

Qureshi³

2016

View full text Add to dashboard Cite

Real-world design and evaluation of compiler-managed GPU redundant multithreading

Wadden

Lyashevsky

Gurumurthi

et al. 2014

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in the construction of reliable supercomputer systems. Because hardware protection is expensive to develop, requires dedicated on-chip resources, and is not portable across different architectures, the efficiency of software solutions such as redundant multithreading (RMT) must be explored. This paper presents a real-world design and evaluation of automatic software RMT on GPU hardware. We first describe a compiler pass that automatically converts GPGPU kernels into redundantly threaded versions. We then perform detailed power and performance evaluations of three RMT algorithms, each of which provides fault coverage to a set of structures in the GPU. Using real hardware, we show that compilermanaged software RMT has highly variable costs. We further analyze the individual costs of redundant work scheduling, redundant computation, and inter-thread communication, showing that no single component in general is responsible for high overheads across all applications; instead, certain workload properties tend to cause RMT to perform well or poorly. Finally, we demonstrate the benefit of architectural support for RMT with a specific example of fast, register-level thread communication

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Vilas Sridharan

A study of DRAM failures in the field

Eliminating microarchitectural dependency from Architectural Vulnerability

Balancing Performance and Reliability in the Memory Hierarchy

XED: Exposing On-Die Error Detection Information for Strong Memory Reliability

Real-world design and evaluation of compiler-managed GPU redundant multithreading

Contact Info

Product

Resources

About