Parallel Assembly of ACA BEM Matrices on Xeon Phi Clusters

Kravčenko, Michal; Malý, Lukáš; Merta, Michal; Zapletal, Jan

doi:10.1007/978-3-319-78024-5_10

Cited by 3 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This GPU would also achieve a much higher performance in the H-matrix setup, since our model BEM code would run much faster than on the rather old Tesla K20X cards. In the future, we also aim at combining a domain-decomposition parallelization with the task-based parallelization as for example in [3,31] to solve even much larger problem sizes.…”

Section: Performance and Scalabilitymentioning

confidence: 99%

“…In the future, we aim at improving the multi-GPU load balancing by techniques proposed e.g. in [3,31]. However, while these techniques work well in the context of non-batched operations, we assume that their combination with batching will still be sub-optimal on GPUs.…”

Section: Performance and Scalabilitymentioning

confidence: 99%

“…Furthermore, [33] discusses a many-core parallel LU-factorization for H-matrices on a Xeon Phi device. BEM4I [31,37] provides a BEM library with ACA running on clusters of multi-/many-core hardware by Intel based on an MPI, OpenMP and vectorization parallelization. In [39], the H-matrix vector product (without setup) has been parallelized on a single GPU and on Intel processors.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

Harbrecht,

Zaspel

2018

Preprint

View full text Add to dashboard Cite

In this work, we consider the solution of boundary integral equations by means of a scalable hierarchical matrix approach on clusters equipped with graphics hardware, i.e. graphics processing units (GPUs). To this end, we extend our existing single-GPU hierarchical matrix library hmglib such that it is able to scale on many GPUs and such that it can be coupled to arbitrary application codes. Using a model GPU implementation of a boundary element method (BEM) solver, we are able to achieve more than 67 percent relative parallel speed-up going from 128 to 1024 GPUs for a model geometry test case with 1.5 million unknowns and a real-world geometry test case with almost 1.2 million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6 minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the setup phase and 20 seconds for the iterative solver. To the best of the authors' knowledge, we here discuss the first fully GPU-based distributed-memory parallel hierarchical matrix Open Source library using the traditional H-matrix format and adaptive cross approximation with an application to BEM problems.

show abstract

Section: Performance and Scalabilitymentioning

confidence: 99%

Section: Performance and Scalabilitymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

Harbrecht,

Zaspel

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…In Sect. 3 we propose a strategy to parallelize the assembly of the MTF matrix blocks and their application in an iterative solver based on the approach presented in [11][12][13] for single domain problems. Except for the distributed parallelism, the method takes full advantage of the BEM4I library [14,20,21] and its assemblers parallelized in shared memory and vectorized by OpenMP.…”

Section: Introductionmentioning

confidence: 99%

Parallel Adaptive Cross Approximation for the Multi-trace Formulation of Scattering Problems

Kravčenko

Zapletal

Claeys

et al. 2020

Parallel Processing and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

We present a highly parallel version of the boundary element method accelerated by the adaptive cross approximation for the efficient solution of scattering problems with composite scatterers. Individual boundary integral operators are treated independently, i.e. the boundary of every homogeneous subdomain is decomposed into clusters of elements defining a block structure of the local matrix. The blocks are distributed across computational nodes by a graph algorithm providing a load balancing strategy. The intra-node implementation further utilizes threading in shared memory and in-core SIMD vectorization to make use of all features of modern processors. The suggested approach is validated on a series of numerical experiments presented in the paper.

show abstract

ACA Improvement by Surface Segmentation

Rjasanow¹,

Weißer²

2019

Lecture Notes in Computational Science and Engineering

View full text Add to dashboard Cite

Parallel Assembly of ACA BEM Matrices on Xeon Phi Clusters

Cited by 3 publications

References 14 publications

A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

Parallel Adaptive Cross Approximation for the Multi-trace Formulation of Scattering Problems

ACA Improvement by Surface Segmentation

Contact Info

Product

Resources

About