IBM Power9 Processor Architecture

Sadasivam, Satish Kumar; Thompto, Brian W.; Kalla, R.; Starke, William J.

doi:10.1109/mm.2017.40

Cited by 86 publications

(35 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPUs are designed for improving system throughput, and their design is aimed at exploiting Thread Level Parallelism (TLP) by supporting the concurrent execution of a vast number of threads. The number of threads that a GPU can simultaneously execute exceeds, in several orders of magnitude, the number of hardware contexts supported by advanced processors like the IBM Power9 [32] or the Intel Knights Landing [33]. This feature is especially important for the execution of parallel scientific applications that rely on a high number of threads.…”

Section: Introductionmentioning

confidence: 99%

An Aging-Aware GPU Register File Design Based on Data Redundancy

Valero

Candel

Gracia

et al. 2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

Nowadays, GPUs sit at the forefront of high-performance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance. At deep nanometer technologies, the SRAM memory cells that implement GPU register files are very sensitive to the Negative Bias Temperature Instability (NBTI) effect. NBTI ages cell transistors by degrading their threshold voltage V th over the lifetime of the GPU. This degradation, which manifests when a cell keeps the same logic value for a relatively long period of time, compromises the cell read stability and increases the transistor switching delay, which can lead to wrong read values and eventually exceed the processor cycle time, respectively, so resulting in faulty operation. This work proposes architectural mechanisms leveraging the redundancy of the data stored in GPU register files to attack NBTI aging. The proposed mechanisms are based on data compression, power gating, and register address rotation techniques. All these mechanisms working together balance the distribution of logic values stored in the cells along the execution time, reducing both the overall V th degradation and the increase in the transistor switching delays. Experimental results show that a conventional GPU register file suffers the worst case for NBTI, since a significant fraction of the cells maintain the same logic value during the entire application execution (i.e., a 100% '0' and '1' duty cycle distributions). On average, the proposal reduces these distributions by 58% and 68%, respectively, which translates into V th degradation savings by 54% and 62%, respectively.

show abstract

Section: Introductionmentioning

confidence: 99%

An Aging-Aware GPU Register File Design Based on Data Redundancy

Valero

Candel

Gracia

et al. 2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

show abstract

“…During runtime, LT will separately fetch the mask bits along with the instructions in the I-cache. 1 Runtime Operations: With this architectural support, we now briey describe the overall operation of the baseline system in DLA mode (Fig. 2).…”

Section: Overview Of Dla Baselinementioning

confidence: 99%

“…Similarly, our baseline DLA can be easily implemented on an SMT substrate. Indeed, if we already have the ability to fuse two cores into a wider one that supports SMT [1], we can build a straightforward DLA-on-SMT system, which simply fetches instructions from both threads in a round-robin fashion and gives no priority to either thread. This system is in general faster than running DLA on two cores, largely because sharing resources improves utilization.…”

Section: Adaptive Resource Allocationmentioning

confidence: 99%

“…Single-thread performance improvement remains a central design goal for general-purpose processors. Over the years, microarchitectural designs have reached a plateau for the core: while there are more cores, features, and bigger structures today [1], the basic out-of-order design is no dierent from more than 20 years ago [2]. Correspondingly, performance measured by IPC has also stagnated.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Bootstrapping

Kondguli

Huang

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

Single-thread performance improvement remains a central design goal for general purpose processors. Microarchitectural designs for the core have reached a plateau over the past years. However, we are still far from exhausting the implicit parallelism available in today's programs. One approach is to use a separate thread context to improve data and instruction supply to the main pipeline. Such decoupled look-ahead (DLA) architectures have been shown to be an eective way to improve single-thread performance. However, a default implementation requires an additional core. While an SMT avor is possible, a naive implementation is inecient and thus slow. In this paper, we propose an optimized implementation called Bootstrapping that makes DLA just as eective on a single (SMT) core as using two cores. While fusing two cores can improve single-thread performance by 1.22x, Bootstrapping provides a speedup of 1.48 over a broad range of benchmark suites, making it a compelling microarchitectural feature for general-purpose microarchitectures. CCS Concepts • Computer systems organization → Architectures; Pipeline computing; Multicore architectures.

show abstract

“…hierarchy, adding a 512KB private second-level SRAM cache to all configurations. We consider two baseline designs: (1) an Intel-like design featuring a 32MB SRAM-based NUCA LLC, referred to as 3level-SRAM, and (2) a 128MB eDRAM-based NUCA LLC similar to the POWER 9 [40], referred to 3level-eDRAM. Using CACTI, we find the bank access latency in the SRAM design to be 7 cycles, and optimistically assume the same access latency for the larger eDRAM banks.…”

Section: F 3-level Cache Hierarchymentioning

confidence: 99%

Farewell My Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers

Shahab

Zhu

Margaritov

et al. 2018

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

View full text Add to dashboard Cite

The slowdown in technology scaling mandates rethinking of conventional CPU architectures in a quest for higher performance and new capabilities. This work takes a step in this direction by questioning the value of on-chip shared lastlevel caches (LLCs) in server processors and argues for a better alternative. Shared LLCs have a number of limitations, including on-chip area constraints that limit storage capacity, long planar interconnect spans that increase access latency, and contention for the shared cache capacity that hurts performance under workload colocation. To overcome these limitations, we propose a Die-Stacked Private LLC Organization (SILO), which combines conventional on-chip private L1 (and optionally, L2) caches with a per-core private LLC in die-stacked DRAM. By stacking LLC slices directly above each core, SILO avoids long planar wire spans. The use of private caches inherently avoids inter-core cache contention. Last but not the least, engineering the DRAM for latency affords low access delays while still providing over 100MB of capacity per core in today's technology. Evaluation results show that SILO outperforms state-of-the-art conventional cache architectures on a range of scale-out and traditional workloads while delivering strong performance isolation under colocation.

show abstract

IBM Power9 Processor Architecture

Cited by 86 publications

References 0 publications

An Aging-Aware GPU Register File Design Based on Data Redundancy

An Aging-Aware GPU Register File Design Based on Data Redundancy

Bootstrapping

Farewell My Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers

Contact Info

Product

Resources

About