Proceedings of the 24th ACM International Conference on Supercomputing 2010
DOI: 10.1145/1810085.1810096
|View full text |Cite
|
Sign up to set email alerts
|

Cache oblivious parallelograms in iterative stencil computations

Abstract: We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results.The performance bene… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
37
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 45 publications
(37 citation statements)
references
References 18 publications
(18 reference statements)
0
37
0
Order By: Relevance
“…Strzodka et al [17] used time skewing and cache-size oblivious parallelograms to improve the memory system pressure and parallelism in stencils on CPUs. Micikevicius et al [15] handtuned a 3-D finite difference computation stencil and achieved an order of magnitude performance increase over existing CPU implementations on GT200-based Tesla GPUs.…”
Section: Related Workmentioning
confidence: 99%
“…Strzodka et al [17] used time skewing and cache-size oblivious parallelograms to improve the memory system pressure and parallelism in stencils on CPUs. Micikevicius et al [15] handtuned a 3-D finite difference computation stencil and achieved an order of magnitude performance increase over existing CPU implementations on GT200-based Tesla GPUs.…”
Section: Related Workmentioning
confidence: 99%
“…A number of recent studies have focused on optimizing stencil computations for multicore CPUs and GPUs [2,6,8,12,17,19,20]. Strzodka et al [17] use time skewing and cache-size oblivious parallelograms to improve the memory system pressure and parallelism in stencils on CPUs.…”
Section: Related Workmentioning
confidence: 99%
“…A number of recent studies have focused on optimizing stencil computations on multicore CPUs [2,6,8,17,19] as well as GPUs [11][12][13].…”
Section: Introductionmentioning
confidence: 99%
“…Song and Li proposed similar techniques in [4]. [3] proposed a cache-oblivious algorithm to compute stencil computations for the same purpose of temporal reuse. [6] proposed circular queue, which uses a special data structure to hold temporary values.…”
Section: Related Workmentioning
confidence: 99%
“…Otherwise, memory bandwidth may become a serious performance bottleneck. Thus, a lot of prior work exploits this kind of reuse across multiple time steps [1][2][3][4][5] .…”
Section: Introductionmentioning
confidence: 99%