Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization 2020
DOI: 10.1145/3368826.3377904
|View full text |Cite
|
Sign up to set email alerts
|

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Abstract: Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil fr… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
35
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 45 publications
(35 citation statements)
references
References 37 publications
0
35
0
Order By: Relevance
“…Rawat et al [51] introduced a domain-specific language called "STENCILGEN" to describe and generate optimized GPU code for stencil computations, leveraging multiple tiling techniques. AN5D [42] is another framework for automatic generation of optimized stencil GPU code, from generic C code. It also depends on different forms of temporal blocking.…”
Section: Related Workmentioning
confidence: 99%
“…Rawat et al [51] introduced a domain-specific language called "STENCILGEN" to describe and generate optimized GPU code for stencil computations, leveraging multiple tiling techniques. AN5D [42] is another framework for automatic generation of optimized stencil GPU code, from generic C code. It also depends on different forms of temporal blocking.…”
Section: Related Workmentioning
confidence: 99%
“…For example, the authors of [4] use a model to optimize the computation/register ratio, which is important for the class of stencils they are targeting. In [5], a standard roofline model with a fixed, theoretical memory volume is used for a full exploration of the configuration space, followed by benchmarking the top five candidates.…”
Section: Related Workmentioning
confidence: 99%
“…There is a rich literature describing efforts to efficiently implement stencil computations on CPUs [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [15], [16], [12] and GPUs [13], [14], [17], [18], [19], [22], [23]. We discuss the most related efforts below.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, as part of their AN5D framework work, Matsumura et al [19] apply three more refinements to 2.5D and 3.5D solutions: fixed register allocations, double buffering, and division of the streaming dimension. While these approaches work extremely well for simple single-statement kernels, neither boundary conditions nor multi-statement stencils are evaluated.…”
Section: Related Workmentioning
confidence: 99%