Locality aware concurrent start for stencil applications

Shrestha, Sunil; Gao, Guang R.; Manzano, Joseph; Márquez, Andrés; Feo, John

doi:10.1109/cgo.2015.7054196

Cited by 9 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When a thread takes a task, atomics are used to avoid race conditions. This work was extended in [Shrestha et al 2015]. They combined their introduced jagged tiling approach with the diamond tiling extension of the PLUTO framework to allow concurrent start at the inter-and intra-tile levels.…”

Section: Related Work Utilizing Cache Block Sharingmentioning

confidence: 99%

See 1 more Smart Citation

Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations

Malas

Hager

Ltaief

et al. 2017

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved.We propose a flexible multi-dimensional intra-tile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects from hardware prefetching or TLB shortage. Our Girih framework includes an auto-tuner to select optimal parameter configurations on the target hardware. We conduct performance experiments on two contemporary Intel processors and compare with the state-of-the-art stencil frameworks PLUTO and Pochoir, using four corner-case stencil schemes and a wide range of problem sizes. Girih shows substantial performance advantages and best arithmetic intensity at almost all problem sizes, especially on low-intensity stencils with variable coefficients. We study in detail the performance behavior at varying grid size using phenomenological performance modeling. Our analysis of energy consumption reveals that our method can save energy by reduced DRAM bandwidth usage even at marginal performance gain. It is thus well suited for future architectures that will be strongly challenged by the cost of data movement, be it in terms of performance or energy consumption. ACM Reference Format:Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes, 2015. Multi-dimensional intra-tile parallelization for memory-starved stencil computations. ACM Trans. Parallel Comput. 0, 0, Article 0 ( 0), 44 pages.

show abstract

Section: Related Work Utilizing Cache Block Sharingmentioning

confidence: 99%

“…In contrast, our MWD approach allows the thread group to share one large diamond tile, providing more in-cache data reuse. Figure 5 of their paper [Shrestha et al 2015] shows an example of their two-level tiling. The diamond tile is split into nine sub-tile updates for fine-grained parallelization.…”

Section: Related Work Utilizing Cache Block Sharingmentioning

confidence: 99%

Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations

Malas

Hager

Ltaief

et al. 2017

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

show abstract

“…Bandishti et al [4] and Bondhugula et al [6] proposed a general formalism for diamond tiling in the polyhedral model by introducing a rescheduling step in the Pluto compiler. There has been a great amount of work [11,13,22,25,33,34] reported on the evaluation of diamond tiling. It was also generalized to handle iterated stencils defined over periodic data domains with index set splitting [5] and the Lattice-Boltzmann method [26].…”

Section: Related Workmentioning

confidence: 99%

Flextended Tiles

Zhao

Cohen²

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom schedulers and code generators insufficiently integrated within the general framework. One of these modified shapes referred to as overlapped tiling also lacks a unifying framework to reason about its composition with affine transformations; this prevents its application in general-purpose loop-nest optimizers and the fair comparison with other techniques. We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional "scalene trapezoid" shapes and novel "right-rectangle" variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and iterated stencils, validating the effectiveness and applicability of our technique on both general-purpose multicores and GPU accelerators. CCS Concepts: • Software and its engineering → Compilers;

show abstract

“…On the other hand, cache block sharing technologies (introduced by Wellein et al [21]), achieve better performance by utilizing the shared hardware caches of modern CPUs. Recently, Shrestha et al [30] introduced cache block sharing techniques within PLUTO framework to perform source-to-source transformation of the stencil codes. To the extent of our knowledge, all proposed cache block sharing temporal blocking techniques compromise tile size for intra-tile concurrency, which we show to be sub-optimal in this work.…”

Section: Related Workmentioning

confidence: 99%

Optimization of an Electromagnetics Code with Multicore Wavefront Diamond Blocking and Multi-dimensional Intra-Tile Parallelization

Malas

Hornich

Hager

et al. 2016

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell's Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Harmonic Inverse Iteration Method (THIIM) with Finite-Difference Frequency Domain (FDFD) discretization. Although this algorithm has the characteristics of a strongly bandwidth-bound stencil update scheme, it is significantly different from the popular stencil types that have been exhaustively studied in the high performance computing literature to date. We apply a recently developed stencil optimization technique, multicore wavefront diamond tiling with multi-dimensional cache block sharing, and describe in detail the peculiarities that need to be considered due to the special stencil structure. Concurrency in updating the components of the electric and magnetic fields provides an additional level of parallelism. The dependence of the cache size requirement of the optimized code on the blocking parameters is modeled accurately, and an auto-tuner searches for optimal configurations in the remaining parameter space. We were able to completely decouple the execution from the memory bandwidth bottleneck, accelerating the implementation by a factor of three to four compared to an optimal implementation with pure spatial blocking on an 18-core Intel Haswell CPU.

show abstract

Locality aware concurrent start for stencil applications

Cited by 9 publications

References 17 publications

Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations

Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations

Flextended Tiles

Optimization of an Electromagnetics Code with Multicore Wavefront Diamond Blocking and Multi-dimensional Intra-Tile Parallelization

Contact Info

Product

Resources

About