2010
DOI: 10.1007/s10766-010-0142-5
|View full text |Cite
|
Sign up to set email alerts
|

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Abstract: Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0
1

Year Published

2011
2011
2017
2017

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 40 publications
(33 citation statements)
references
References 33 publications
0
32
0
1
Order By: Relevance
“…Recent work has shown promise for high performance by use of overlapped tiling on GPUs [11] for stencil computations. In this paper, we present an automated approach to generate efficient overlapped tiling code for stencil computations on GPUs.…”
Section: Stencil Computationsmentioning
confidence: 99%
See 3 more Smart Citations
“…Recent work has shown promise for high performance by use of overlapped tiling on GPUs [11] for stencil computations. In this paper, we present an automated approach to generate efficient overlapped tiling code for stencil computations on GPUs.…”
Section: Stencil Computationsmentioning
confidence: 99%
“…Our equivalent Jacobi 2-D stencil achieves 49.5 GFlop/s in double-precision mode on the GTX 580. Meng et al [11] report approximately 2 × 10 6 cycles per iteration on a GTX 280 for a Poisson stencil that has been manually tiled using overlapped tiling with a time tile size of 3. With a clock speed of 1.3 GHz, this gives approximately 70.2 GFlop/s.…”
Section: Performance Analysismentioning
confidence: 99%
See 2 more Smart Citations
“…In the literature, the problem of designing efficient implementations for this class of algorithms has been addressed for both CPUs ( [5], [6]) and GPGPUs ( [7], [8]): on such architectures, the main problems that have been faced are the memory organization and the data transfers.…”
Section: State-of-the-art Implementationsmentioning
confidence: 99%