2013
DOI: 10.1007/978-3-642-36036-7_16
|View full text |Cite
|
Sign up to set email alerts
|

Static Compilation Analysis for Host-Accelerator Communication Optimization

Abstract: Abstract. We present an automatic, static program transformation that schedules and generates ecient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips/Par4All compiler. In the generate… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 22 publications
0
17
0
Order By: Relevance
“…It creates opportunities for such overlap by transforming the computation into multiple chunks and transferring the data for chunk '(i+1)' while executing chunk 'i'. Our technique contrasts with those that reduce data transfer overhead by eliminating redundant memory transfers [1,2] and by advancing or delaying the data copy operations [3]. Second, the pipelining technique builds on an enabling technique that deals with another important issue in accelerators: The accelerator's memory space is limited; computation that fits in the CPU's memory may exceed the accelerator's capacity.…”
Section: Introductionmentioning
confidence: 94%
“…It creates opportunities for such overlap by transforming the computation into multiple chunks and transferring the data for chunk '(i+1)' while executing chunk 'i'. Our technique contrasts with those that reduce data transfer overhead by eliminating redundant memory transfers [1,2] and by advancing or delaying the data copy operations [3]. Second, the pipelining technique builds on an enabling technique that deals with another important issue in accelerators: The accelerator's memory space is limited; computation that fits in the CPU's memory may exceed the accelerator's capacity.…”
Section: Introductionmentioning
confidence: 94%
“…It relies on pragmas, but it does not ease the programmer task as much as the Cray, CAPS or PGI compilers. Other experimental compilation tools like CGCM [16] and PAR4ALL [1] aim at automating the process of CPU-GPU communication and the detection of the pieces of code that can run in parallel. The work by Lee and Eigenmann [20] proposes OpenMPC, an API to facilitate translation of OpenMP programs to CUDA, and a compilation system to support it.…”
Section: Related Workmentioning
confidence: 99%
“…4 As the write regions are empty for src, this corresponds to the loads. The SCALOPES project associated an asymmetric MP-SoC with cores dedicated to task scheduling, to a semi-automatic parallelization workow.…”
Section: Applicationsmentioning
confidence: 99%
“…The second point has been addressed using simplied input from the programmer [13,27,19], or automatically [4,24,1,26] using compilers. This paper exposes how the array regions abstraction [11] can be used by a compiler to automatically compute memory transfers in presence of complex code patterns.…”
Section: Introductionmentioning
confidence: 99%