Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Datta, Kaushik; Kamil, Shoaib; Williams, Samuel; Oliker, Leonid; Shalf, John; Yelick, Katherine

doi:10.1137/070693199

Cited by 180 publications

(159 citation statements)

References 18 publications

(41 reference statements)

Supporting

Mentioning

155

Contrasting

Unclassified

Order By: Relevance

“…Variable or constant weighted contributions of these neighbours represent the discretized coefficients of the given PDE for a particular data point. Typically stencil based codes achieve poor performance due to low arithmetic intensity [13], [17]. This suggests that arithmetic intensity and cache optimization ought not be neglected when selecting a suitable decomposition.…”

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

“…Parallel efficiency is inherently connected to an optimized serial code and there have been numerous efforts to optimize the re-use of data in the cache memory [10], [11], [12], [13], [16]. Cache blocking/tiling optimizations for maximum cache reuse have focussed both on using appropriate block sizes of data to improve spatial locality as well as enhancing data locality between adjacent time steps or iterations [9], [10], [11], [12], [13], [16].…”

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

“…Performance optimization can start with domain decomposition at the macro-level. Figure 4 illustrates that traditional optimizations only consider reducing the cache misses [9] after performing domain decomposition [10], [11], [12], [13], [14]. We take a reverse approach in the sense that we derive a domain decomposition based on optimization of cache-misses.…”

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

See 2 more Smart Citations

A cache-aware approach to domain decomposition for stencil-based codes

Saxena

Jimack

Walkley

2016

2016 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Abstract-Partial Differential Equations (PDEs) lie at the heart of numerous scientific simulations depicting physical phenomena. The parallelization of such simulations introduces additional performance penalties in the form of local and global synchronization among cooperating processes. Domain decomposition partitions the largest shareable data structures into sub-domains and attempts to achieve perfect load balance and minimal communication. Up to now research efforts to optimize spatial and temporal cache reuse for stencil-based PDE discretizations (e.g. finite difference and finite element) have considered sub-domain operations after the domain decomposition has been determined. We derive a cache-oblivious heuristic that minimizes cache misses at the sub-domain level through a quasi-cache-directed analysis to predict families of high performance domain decompositions in structured 3-D grids. To the best of our knowledge this is the first work to optimize domain decompositions by analyzing cache misses -thus connecting single core parameters (i.e. cache-misses) to true multicore parameters (i.e. domain decomposition). We analyze the trade-offs in decreasing cache-misses through such decompositions and increasing the dynamic bandwidth-per-core. The limitation of our work is that currently, it is applicable only to structured 3-D grids with cuts parallel to the Cartesian Axes. We emphasize and conclude that there is an imperative need to re-think domain decompositions in this constantly evolving multicore era.

show abstract

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

See 1 more Smart Citation

A cache-aware approach to domain decomposition for stencil-based codes

Saxena

Jimack

Walkley

2016

2016 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

show abstract

“…A number of works have addressed optimizations of stencil computations on emerging multicore platforms [7], [16], [17], [6], [27], [26], [11], [37], [10], [4], [9], [40], [38], [41], [8], [39]. In addition, other transformations such as tiling of stencil computations for multicore architectures have been addressed in [43], [25], [21], [34].…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

“…Short "stanzas" of memory access see substantially degraded performance. Stanza Triad was created to quantify this effect [9]. Unfortunately, it is not threaded and as such cannot identify when one has transitioned from a concurrency-limited regime to a throughput-limited regime when running on multicore processors.…”

Section: Related Workmentioning

confidence: 99%