Unlocking bandwidth for GPUs in CC-NUMA systems

Agarwal, Neha; Nellans, David; O'Connor, Mike; Keckler, Stephen W.; Wenisch, Thomas F.

doi:10.1109/hpca.2015.7056046

Cited by 68 publications

(28 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If the GPU's stacked DRAMs are placed at the same level as the CPU memory, then a flat non-uniform memory access (flat-NUMA) organization can be created to leverage the high bandwidth of the GPU memory, which will reduce bandwidth contention on the CPU memory [Bolotin et al 2015]. But coherency can only be maintained with intelligent data migration [Agarwal et al 2015]. Migrating data between memory units of multiple CPU and GPU devices requires more complex software and will introduce software-based memory management overhead to the system.…”

Section: Related Workmentioning

confidence: 99%

“…In zero-copy, accesses are made to the host memory. This is while today's GPUs have a large memory space and can provide bandwidth that is much higher than the bandwidth present on host memory [Agarwal et al 2015] (i.e., High-Bandwidth Memory (HBM) interface, targeted for GPUs, provides more than 100GB/s bandwidth per Dynamic Random-Access Memory (DRAM) stack, with multiple stacks integrated per chip [AMD 2015b]). Another drawback of zero-copy is the use of pinned pages to keep the data used by the GPU in the host memory until the end of the GPU kernel execution.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Umh

Ziabari

Schaa

et al. 2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92× in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13× faster.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Umh

Ziabari

Schaa

et al. 2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Other techniques improve application performance on GPUs through addressing the problems of data transfer [58,59], thread divergence [60], data placement [61], synchronization overhead [62] and configuration tuning [63,64]. GPU resource sharing has been studied at both system [65,66] and architecture levels [67,68] to address the resource contention and performance interference.…”

Section: Scheduling On Acceleratormentioning

confidence: 99%

Baymax

Chen

Yang

Mars

et al. 2016

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In

show abstract

“…From these components, the total latency of a VC is simply the sum of VC access latency (access rate × network and bank latency) and memory latency (miss rate × miss penalty). 1 The runtime builds the total latency curves for each VC and uses them to partition cache capacity. Traditional cache partitioning schemes try to minimize cache misses and partition using miss rate curves [52,55].…”

Section: Jigsaw: Our Baseline Systemmentioning

confidence: 99%

“…Instead, Whirlpool relies on distinguishing among the few main classes of data in the program, which makes online adaptation inexpensive. Recent work has studied page placement for systems with heterogeneous and non-uniform main memory (NUMA) [1,23,69]. NUMA techniques also have different goals and constraints than NUCA: main memory is larger and has significantly lower bandwidth, so these designs primarily seek to balance bandwidth over network distance and capacity, and reconfigurations are much more infrequent.…”

Section: Additional Related Workmentioning

confidence: 99%

Whirlpool

Mukkara

Beckmann

Sánchez

2016

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on regular programs, but because they set fixed policies, they are vulnerable to changes in program behavior or available cache space. Instead, most systems rely on dynamic caching policies that adapt to observed program behavior. Unfortunately, dynamic policies spend significant resources trying to learn how programs use memory, and yet they often perform worse than a static policy. We present Whirlpool, a novel approach that combines static information with dynamic policies to reap the benefits of each. Whirlpool statically classifies data into pools based on how the program uses memory. Whirlpool then uses dynamic policies to tune the cache to each pool. Hence, rather than setting policies statically, Whirlpool uses static analysis to guide dynamic policies. We present both an API that lets programmers specify pools manually and a profiling tool that discovers pools automatically in unmodified binaries. We evaluate Whirlpool on a state-of-the-art NUCA cache. Whirlpool significantly outperforms prior approaches: on sequential programs, Whirlpool improves performance by up to 38% and reduces data movement energy by up to 53%; on parallel programs, Whirlpool improves performance by up to 67% and reduces data movement energy by up to 2.6×. CCS Concepts • Computer systems organization → Multicore architectures.

show abstract

Unlocking bandwidth for GPUs in CC-NUMA systems

Cited by 68 publications

References 29 publications

Umh

Umh

Baymax

Whirlpool

Contact Info

Product

Resources

About