2017
DOI: 10.1109/mm.2017.40
|View full text |Cite
|
Sign up to set email alerts
|

IBM Power9 Processor Architecture

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 86 publications
(35 citation statements)
references
References 0 publications
0
31
0
Order By: Relevance
“…GPUs are designed for improving system throughput, and their design is aimed at exploiting Thread Level Parallelism (TLP) by supporting the concurrent execution of a vast number of threads. The number of threads that a GPU can simultaneously execute exceeds, in several orders of magnitude, the number of hardware contexts supported by advanced processors like the IBM Power9 [32] or the Intel Knights Landing [33]. This feature is especially important for the execution of parallel scientific applications that rely on a high number of threads.…”
Section: Introductionmentioning
confidence: 99%
“…GPUs are designed for improving system throughput, and their design is aimed at exploiting Thread Level Parallelism (TLP) by supporting the concurrent execution of a vast number of threads. The number of threads that a GPU can simultaneously execute exceeds, in several orders of magnitude, the number of hardware contexts supported by advanced processors like the IBM Power9 [32] or the Intel Knights Landing [33]. This feature is especially important for the execution of parallel scientific applications that rely on a high number of threads.…”
Section: Introductionmentioning
confidence: 99%
“…During runtime, LT will separately fetch the mask bits along with the instructions in the I-cache. 1 Runtime Operations: With this architectural support, we now briey describe the overall operation of the baseline system in DLA mode (Fig. 2).…”
Section: Overview Of Dla Baselinementioning
confidence: 99%
“…Similarly, our baseline DLA can be easily implemented on an SMT substrate. Indeed, if we already have the ability to fuse two cores into a wider one that supports SMT [1], we can build a straightforward DLA-on-SMT system, which simply fetches instructions from both threads in a round-robin fashion and gives no priority to either thread. This system is in general faster than running DLA on two cores, largely because sharing resources improves utilization.…”
Section: Adaptive Resource Allocationmentioning
confidence: 99%
See 1 more Smart Citation
“…hierarchy, adding a 512KB private second-level SRAM cache to all configurations. We consider two baseline designs: (1) an Intel-like design featuring a 32MB SRAM-based NUCA LLC, referred to as 3level-SRAM, and (2) a 128MB eDRAM-based NUCA LLC similar to the POWER 9 [40], referred to 3level-eDRAM. Using CACTI, we find the bank access latency in the SRAM design to be 7 cycles, and optimistically assume the same access latency for the larger eDRAM banks.…”
Section: F 3-level Cache Hierarchymentioning
confidence: 99%