2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) 2015
DOI: 10.1109/hpca.2015.7056046
|View full text |Cite
|
Sign up to set email alerts
|

Unlocking bandwidth for GPUs in CC-NUMA systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 68 publications
(28 citation statements)
references
References 29 publications
0
28
0
Order By: Relevance
“…If the GPU's stacked DRAMs are placed at the same level as the CPU memory, then a flat non-uniform memory access (flat-NUMA) organization can be created to leverage the high bandwidth of the GPU memory, which will reduce bandwidth contention on the CPU memory [Bolotin et al 2015]. But coherency can only be maintained with intelligent data migration [Agarwal et al 2015]. Migrating data between memory units of multiple CPU and GPU devices requires more complex software and will introduce software-based memory management overhead to the system.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…If the GPU's stacked DRAMs are placed at the same level as the CPU memory, then a flat non-uniform memory access (flat-NUMA) organization can be created to leverage the high bandwidth of the GPU memory, which will reduce bandwidth contention on the CPU memory [Bolotin et al 2015]. But coherency can only be maintained with intelligent data migration [Agarwal et al 2015]. Migrating data between memory units of multiple CPU and GPU devices requires more complex software and will introduce software-based memory management overhead to the system.…”
Section: Related Workmentioning
confidence: 99%
“…In zero-copy, accesses are made to the host memory. This is while today's GPUs have a large memory space and can provide bandwidth that is much higher than the bandwidth present on host memory [Agarwal et al 2015] (i.e., High-Bandwidth Memory (HBM) interface, targeted for GPUs, provides more than 100GB/s bandwidth per Dynamic Random-Access Memory (DRAM) stack, with multiple stacks integrated per chip [AMD 2015b]). Another drawback of zero-copy is the use of pinned pages to keep the data used by the GPU in the host memory until the end of the GPU kernel execution.…”
Section: Introductionmentioning
confidence: 99%
“…Other techniques improve application performance on GPUs through addressing the problems of data transfer [58,59], thread divergence [60], data placement [61], synchronization overhead [62] and configuration tuning [63,64]. GPU resource sharing has been studied at both system [65,66] and architecture levels [67,68] to address the resource contention and performance interference.…”
Section: Scheduling On Acceleratormentioning
confidence: 99%
“…From these components, the total latency of a VC is simply the sum of VC access latency (access rate × network and bank latency) and memory latency (miss rate × miss penalty). 1 The runtime builds the total latency curves for each VC and uses them to partition cache capacity. Traditional cache partitioning schemes try to minimize cache misses and partition using miss rate curves [52,55].…”
Section: Jigsaw: Our Baseline Systemmentioning
confidence: 99%
“…Instead, Whirlpool relies on distinguishing among the few main classes of data in the program, which makes online adaptation inexpensive. Recent work has studied page placement for systems with heterogeneous and non-uniform main memory (NUMA) [1,23,69]. NUMA techniques also have different goals and constraints than NUCA: main memory is larger and has significantly lower bandwidth, so these designs primarily seek to balance bandwidth over network distance and capacity, and reconfigurations are much more infrequent.…”
Section: Additional Related Workmentioning
confidence: 99%