Soft-error resilience of the IBM POWER6 processor

Sanda, P. N.; Kellington, Jeffrey W.; Kudva, P.; Kalla, R.; McBeth, R. B.; Ackaret, J.; Lockwood, R.; Schumann, John; Jones, Carolyn

doi:10.1147/rd.523.0275

Cited by 97 publications

(38 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The design overhead evaluation of the proposed design is presented in Sect. 8. We conclude our work in Sect.…”

Section: Introductionmentioning

confidence: 80%

“…In order to handle these inevitable errors, we must integrate in our design fault-tolerant features so that processors can continue to correctly perform their specified tasks despite the occurrence of logic errors [5]. Such designs as the Intel Itanium [6,7], the IBM Power6 [8], the z10 [9], the Fujitsu SPARC64 [10], etc., already include transient fault detection and recovery mechanisms.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tolerating Radiation-Induced Transient Faults in Modern Processors

Gaudiot

2009

Int J Parallel Prog

View full text Add to dashboard Cite

As MOS device sizes continue shrinking, lower charges, for example those charges carried by single ionizing particles of naturally occurring radiation, are sufficient to upset the functioning of complex modern microprocessors. In order to handle these inevitable errors, designs should include fault-tolerant features so that the processors can continue to correctly perform despite the occurrence of errors. The main goal of this work is to develop architecture mechanisms to protect processors against the effect of such radiation-induced transient faults. It should first be noted that, from a program execution perspective, many faults manifest themselves as control flow errors that cause processors to violate the correct sequencing of instructions. We present here at first a basic compile-time signature assignment algorithm and describe a novel approach to improve the fault detection coverage of the basic algorithm. Moreover, to allow the processor to efficiently check the run-time sequence and detect control flow errors, we introduce an on-chip assigned-signature checker which is capable of executing three additional instructions (SIC, SIJ, SIJC). Second, since the very concept of simultaneous multi-threading (SMT) provides the necessary redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby effecting a completely error-free execution. This paper has focused on two crucial implementation issues introduced by this

show abstract

“…The design overhead evaluation of the proposed design is presented in Sect. 8. We conclude our work in Sect.…”

Section: Introductionmentioning

confidence: 80%

Section: Introductionmentioning

confidence: 99%

Tolerating Radiation-Induced Transient Faults in Modern Processors

Gaudiot

2009

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…For instance, in an in-order RISC core the execution and memory stages are highly vulnerable to dynamic variations, and the memory class has a higher vulnerability in comparison to the logical/arithmetic class [8]. We note that complex high-performance cores such as IBM POWER6 also confirm that vulnerability is not uniform across the instructions set [19]. We extend the notion of ILV to a more coarse-grained task-level metric, TLV.…”

Section: Task-level Vulnerability (Tlv) and Openmp Tasksmentioning

confidence: 87%

Variation-tolerant OpenMP Tasking on Tightly-coupled Processor Clusters

Rahimi

Marongiu

Burgio

et al. 2013

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2013

View full text Add to dashboard Cite

We present a variation-tolerant tasking technique for tightlycoupled shared memory processor clusters that relies upon modeling advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking programming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuitlevel variability as a high-level software knowledge. This is accomplished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Software supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core processor cluster is increased up to 1.51× (1.17× on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(∆T=90°C).

show abstract

“…These statistics can be derived from a properly validated, cycle-accurate architectural simulator as explained in [65]. As has been observed in prior work on POWER machines [69], the AD factor tends to dominate over MD by a large factor, especially if the focus is only on SDC. Therefore, for simplicity of analysis, in this chapter we only focus on AD, while effectively assuming that MD is invariant across the class of applications considered for a given (fixed) machine implementation.…”

Section: Resilience Modelingmentioning

confidence: 96%

“…For soft error rate (SER) modeling, we estimate the failure rate (measured in standard units of failures in time or FITs) using an approach adopted from industrial practice [44,69]. In such an evaluation methodology, machine-level derating (MD) and application-level derating (AD) are treated as decoupled factors.…”

Section: Resilience Modelingmentioning

confidence: 99%

First-Order Modeling Frameworks for Power-Efficient and Reliable Multiprocessor Systems

Wang¹

View full text Add to dashboard Cite

As the semiconductor industry keeps evolving at the pace predicted by Moore's Law, computer system architects are facing increasing challenges from the three major design constraints: performance, power, and reliability. Thermal constraints from a reasonable cooling cost do not scale well as technology evolves. The dwindling scaling on threshold voltage leads to a slower pace of supply voltage scaling. These two effects lead to an increasing power density in current and future technology generations. Reliability has emerged as a primary design constraint due to the smaller feature size and generally lower supply voltage for electronic devices. Transient errors caused by high-energy particle strike and voltage noises are expected to increase significantly in frequency.Performance improvement becomes more challenging for future architectures with limitations set by the power and the resilience constraints.Integration of accelerators to create heterogeneous processors is becoming more common for both power and performance reasons. However, this adds one more dimension to the design space that is already complex due to technology variants, system organizations, application's variability, and so on. Therefore, high-level models are essential for system designers to explore the design space and make decisions in a timely manner. Additionally, the three design constraints compete with each other. For example, resilience-aware techniques, such as DMR and TMR, are expensive in terms of power and performance, low-power designs usually come with a price of lower speed.Consequently, it requires system designers to make trade-offs by considering all the three design constraints at the same time.To address these challenges, we (1) propose an analytical modeling framework called Lumos that is capable of modeling power and performance for heterogeneous architectures with hardware iv v accelerators. Then we (2) use Lumos to explore the design space composed of CPU cores and accelerators, revealing important scaling trend for future heterogeneous architectures. We further (3) propose a rapid modeling framework to characterize resilience across a range of applications in DSP, and image processing domains; And finally we (4) propose an integrated framework to optimize energy-efficiency by trading off design constraints of power, performance and resilience. Acknowledgments

show abstract

Soft-error resilience of the IBM POWER6 processor

Cited by 97 publications

References 16 publications

Tolerating Radiation-Induced Transient Faults in Modern Processors

Tolerating Radiation-Induced Transient Faults in Modern Processors

Variation-tolerant OpenMP Tasking on Tightly-coupled Processor Clusters

First-Order Modeling Frameworks for Power-Efficient and Reliable Multiprocessor Systems

Contact Info

Product

Resources

About