Static Compilation Analysis for Host-Accelerator Communication Optimization

Amini, Mehdi; Coelho, Fabien; Irigoin, François; Keryell, Ronan

doi:10.1007/978-3-642-36036-7_16

Cited by 19 publications

(17 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It creates opportunities for such overlap by transforming the computation into multiple chunks and transferring the data for chunk '(i+1)' while executing chunk 'i'. Our technique contrasts with those that reduce data transfer overhead by eliminating redundant memory transfers [1,2] and by advancing or delaying the data copy operations [3]. Second, the pipelining technique builds on an enabling technique that deals with another important issue in accelerators: The accelerator's memory space is limited; computation that fits in the CPU's memory may exceed the accelerator's capacity.…”

Section: Introductionmentioning

confidence: 94%

Scaling large-data computations on multi-GPU accelerators

Sabne

Sakdhnagool

Eigenmann

2013

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

View full text Add to dashboard Cite

Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run outof-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.

show abstract

Section: Introductionmentioning

confidence: 94%

Scaling large-data computations on multi-GPU accelerators

Sabne

Sakdhnagool

Eigenmann

2013

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…It relies on pragmas, but it does not ease the programmer task as much as the Cray, CAPS or PGI compilers. Other experimental compilation tools like CGCM [16] and PAR4ALL [1] aim at automating the process of CPU-GPU communication and the detection of the pieces of code that can run in parallel. The work by Lee and Eigenmann [20] proposes OpenMPC, an API to facilitate translation of OpenMP programs to CUDA, and a compilation system to support it.…”

Section: Related Workmentioning

confidence: 99%

Directive-Based Compilers for GPUs

Ghike

Tejero

Garzarán

et al. 2015

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. General Purpose Graphics Computing Units can be effectively used for enhancing the performance of many contemporary scientific applications. However, programming GPUs using machine-specific notations like CUDA or OpenCL can be complex and time consuming. In addition, the resulting programs are typically fine-tuned for a particular target device. A promising alternative is to program in a conventional and machine-independent notation extended with directives and use compilers to generate GPU code automatically. These compilers enable portability and increase programmer productivity and, if effective, would not impose much penalty on performance. This paper evaluates two such compilers, PGI and Cray. We first identify a collection of standard transformations that these compilers can apply. Then, we propose a sequence of manual transformations that programmers can apply to enable the generation of efficient GPU kernels. Lastly, using the Rodinia Benchmark suite, we compare the performance of the code generated by the PGI and Cray compilers with that of code written in CUDA. Our evaluation shows that the code produced by the PGI and Cray compilers can perform well. For 6 of the 15 benchmarks that we evaluated, the compiler generated code achieved over 85% of the performance of a hand-tuned CUDA version.

show abstract

“…4 As the write regions are empty for src, this corresponds to the loads. The SCALOPES project associated an asymmetric MP-SoC with cores dedicated to task scheduling, to a semi-automatic parallelization workow.…”

Section: Applicationsmentioning

confidence: 99%

“…The second point has been addressed using simplied input from the programmer [13,27,19], or automatically [4,24,1,26] using compilers. This paper exposes how the array regions abstraction [11] can be used by a compiler to automatically compute memory transfers in presence of complex code patterns.…”

Section: Introductionmentioning

confidence: 99%

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Guelton

Amini

Creusillet³

2013

Languages and Compilers for Parallel Computing

Self Cite

View full text Add to dashboard Cite

Abstract. Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control ows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control ow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specic hardware accelerators.

show abstract

Static Compilation Analysis for Host-Accelerator Communication Optimization

Cited by 19 publications

References 22 publications

Scaling large-data computations on multi-GPU accelerators

Scaling large-data computations on multi-GPU accelerators

Directive-Based Compilers for GPUs

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Contact Info

Product

Resources

About