Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Pai, Sreepathi; Govindarajan, Raghav; Thazhuthaveetil, Matthew J.

doi:10.1145/2370816.2370824

Cited by 35 publications

(17 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Like previous memory management systems for GPUs ( [13,17,12,9]), SemCache maintains what amounts to a distributed shared memory (DSM) between the CPU and GPU. SemCache tracks shared data at a variable granularity-as memory ranges.…”

Section: Semcachementioning

confidence: 99%

“…Prior work has used compiler analysis or programmer annotations to determine if the operation is a read or a write [13,12,9,17]. Since SemCache++ focuses on libraries, it can use simple directives inserted into the library code to indicate which matrices are read and written by the GPU, as well as which submatrices are needed by tasks dispatched to various GPUs.…”

Section: Instrumenting Gpu Reads and Writesmentioning

confidence: 99%

“…In the single-GPU setting, there is substantial prior work on automatic data management [13,12,9,17]. Some approaches rely on compiler-assisted software coherence [13,17], limiting applicability and scalability.…”

Section: Automatic Memory Managementmentioning

confidence: 99%

“…To help tackle this problem, over the past several years there have been several proposals to introduce automatic memory management between the CPU and single GPU, freeing the programmer from the burden of managing data movement [13,12,9,17,2]; in fact, the newest version of CUDA [15] offers Unified Memory (UM), which dynamically tracks data movement between the CPU and GPU, minimizing communication. As a result of these techniques, library-based offloading is a viable option for leveraging a GPU in a heterogeneous system.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SemCache++

Al-Saber

Kulkarni

2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

Offloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. GPU drop-in libraries (which require no program rewrite) have made it easy to offload computations to multiple GPUs by hiding this complexity inside library calls. Such encapsulation prevents the reuse of data between successive kernel invocations resulting in redundant communication. This limitation exists in multi-GPU libraries like CUBLASXT.In this paper, we introduce SemCache++, a semantics-aware GPU cache that automatically manages communication between the CPU and multiple GPUs in addition to optimizing communication by eliminating redundant transfers using caching. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. Our caching technique is efficient; it uses a two level caching directory to track matrices and sub-matrices. Experimental results show that our system can eliminate redundant communication and deliver performance improvements over multi-GPU libraries like StarPU and CUBLASXT.

show abstract

Section: Semcachementioning

confidence: 99%

Section: Instrumenting Gpu Reads and Writesmentioning

confidence: 99%

Section: Automatic Memory Managementmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SemCache++

Al-Saber

Kulkarni

2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Pai et al propose a system that automates CPU-GPU memory management based on a coherence scheme in order to reduce superfluous communication [14]. To do this, when a data item is accessed on one side (CPU or GPU side), it is transferred (from the other side) if it is not locally available or if its local version is stale.…”

Section: Related Workmentioning

confidence: 99%

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

Mokhtari

Stumm

2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways; e.g., for transformations, filtering, aggregation, partitioning or other "Big Data" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs.In this paper, we propose BigKernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. BigKernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently; these kernels are transformed into BigKernel using straight-forward compiler transformations.Our evaluation on six data-intensive benchmarks shows that BigKernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.

show abstract