A cache-aware approach to domain decomposition for stencil-based codes

Concurrency and Computation

Jimack

Walkley

2017

Self Cite

Summary Stencil computations form the heart of numerical simulations to solve Partial Differential Equations using Finite Difference, Finite Element, and Finite Volume methods. Geometric Multigrid is an optimal scriptOfalse(Nfalse), hierarchical tool employing stencil computations in its chief constituents, namely, smoothing, restriction, and interpolation. When Multigrid is parallelized over distributed‐shared memory architectures, traditionally, the domain partitioning creates cubic partitions of the mesh to minimize overall communication. Thus, the orthodox approach considers only load‐balancing and communication minimization for completely determining the domain partitioning. In this article, we show that these two factors are not sufficient to obtain optimal partitions for Parallel Geometric Multigrid. To this effect, we develop and validate a high level analytical model to show that “close to 2‐D” partitions for Geometric Multigrid can give higher performance than the partitions returned by the MPI_Dims_create() function which minimizes the communication volume by default. We quantify sub‐domain level cache‐misses in Parallel Geometric Multigrid and obtain families of optimal domain partitions. We conclude that the sub‐domain level cache‐misses for the application‐specific stencil computational kernel and communicated planes should be taken into account in addition to communication minimization/load‐balance to obtain optimal partitions for Parallel Geometric Multigrid.

Section: Terminology and Problem Descriptionmentioning

confidence: 70%

Section: Background and Related Workmentioning

confidence: 99%

Section: Terminology and Problem Descriptionmentioning

confidence: 99%

Section: Our Modelmentioning

confidence: 99%

Section: Stencils and Cache Optimizationmentioning

confidence: 99%

See 3 more Smart Citations

A quasi‐cache‐aware model for optimal domain partitioning in parallel geometric multigrid

Concurrency and Computation

Jimack

Walkley

2017

Self Cite

BioFVM-X: An MPI+OpenMP 3-D Simulator for Biological Systems

Computational Methods in Systems Biology

León

Montagud

et al. 2021

Self Cite

Multi-scale simulations require parallelization to address large-scale problems, such as real-sized tumor simulations. BioFVM is a software package that solves diffusive transport Partial Differential Equations for 3-D biological simulations successfully applied to tissue and cancer biology problems. Currently, BioFVM is only shared-memory parallelized using OpenMP, greatly limiting the execution of large-scale jobs in HPC clusters. We present BioFVM-X: an enhanced version of BioFVM capable of running on multiple nodes. BioFVM-X uses MPI+OpenMP to parallelize the generic core kernels of BioFVM and shows promising scalability in large 3-D problems with several hundreds diffusible substrates and $$\approx $$ ≈ 0.5 billion voxels. The BioFVM-X source code, examples and documentation, are available under the BSD 3-Clause license at https://gitlab.bsc.es/gsaxena/biofvm_x.

A Cache-Aware Approach to Adaptive Mesh Refinement in Parallel Stencil-Based Solvers

2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference On

Jimack

Walkley

2017

Self Cite

In prior-research the authors have demonstrated that, for stencil-based numerical solvers for Partial Differential Equations (PDEs), the parallel performance can be significantly improved by selecting sub-domains that are not cubic in shape (Saxena et. al., HPCS 2016, pp. 875-885). This is achieved through accounting for cache utilization in both the message passing and the computational kernel, where it is demonstrated that the optimal domain decompositions not only depend on the communication and load balance but also on the cache-misses, amongst other factors. In this work we demonstrate that those conclusions may also be extended to more advanced numerical discretizations, based upon Adaptive Mesh Refinement (AMR). In particular, we show that when basing our AMR strategy on the local refinement of patches of the mesh, the optimal patch shape is not typically cubic. We provide specific examples, with accompanying explanation, to show that communication minimizing strategies are not necessarily the best choice when applying AMR in parallel. All numerical tests undertaken in this work are based upon the open source BoxLib library.