2009
DOI: 10.1137/070693199
|View full text |Cite
|
Sign up to set email alerts
|

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Abstract: Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple st… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
155
1
1

Year Published

2011
2011
2016
2016

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 180 publications
(159 citation statements)
references
References 18 publications
(41 reference statements)
2
155
1
1
Order By: Relevance
“…Variable or constant weighted contributions of these neighbours represent the discretized coefficients of the given PDE for a particular data point. Typically stencil based codes achieve poor performance due to low arithmetic intensity [13], [17]. This suggests that arithmetic intensity and cache optimization ought not be neglected when selecting a suitable decomposition.…”
Section: Or the Finite Element Methods (Fem)mentioning
confidence: 99%
See 2 more Smart Citations
“…Variable or constant weighted contributions of these neighbours represent the discretized coefficients of the given PDE for a particular data point. Typically stencil based codes achieve poor performance due to low arithmetic intensity [13], [17]. This suggests that arithmetic intensity and cache optimization ought not be neglected when selecting a suitable decomposition.…”
Section: Or the Finite Element Methods (Fem)mentioning
confidence: 99%
“…Parallel efficiency is inherently connected to an optimized serial code and there have been numerous efforts to optimize the re-use of data in the cache memory [10], [11], [12], [13], [16]. Cache blocking/tiling optimizations for maximum cache reuse have focussed both on using appropriate block sizes of data to improve spatial locality as well as enhancing data locality between adjacent time steps or iterations [9], [10], [11], [12], [13], [16].…”
Section: Or the Finite Element Methods (Fem)mentioning
confidence: 99%
See 1 more Smart Citation
“…A number of works have addressed optimizations of stencil computations on emerging multicore platforms [7], [16], [17], [6], [27], [26], [11], [37], [10], [4], [9], [40], [38], [41], [8], [39]. In addition, other transformations such as tiling of stencil computations for multicore architectures have been addressed in [43], [25], [21], [34].…”
Section: Related Workmentioning
confidence: 99%
“…Short "stanzas" of memory access see substantially degraded performance. Stanza Triad was created to quantify this effect [9]. Unfortunately, it is not threaded and as such cannot identify when one has transitioned from a concurrency-limited regime to a throughput-limited regime when running on multicore processors.…”
Section: Related Workmentioning
confidence: 99%