Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Abduljabbar, Mustafa; Farhan, Mohammed Al; Yokota, Rio; Keyes, David E.

doi:10.1007/978-3-319-64203-1_40

Cited by 7 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This effectively means that the Intel compiler rather generates optimal vector code for the kernel. Hence, writing an intrinsics code for a routine, where the compiler successfully manages to vectorize it, is very often unnecessary [51]. …”

Section: Vectorization Efficiency Of the Edge-based Loopmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Section: Vectorization Efficiency Of the Edge-based Loopmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

“…In addition, AoS enhances the locality of references for interacting particles after they are sorted and indexed based on their Morton order. Cells maintain both indexes and struct particle_t { num SRC ; num COORD [3]; } __attribute__ (( aligned (64) ) ) ; struct cell_t { particle_t * b_ptr ; size_t b_count ; } __attribute__ (( aligned (64) ) ) ; particle_t * particles ; cell_t * cells ;…”

Section: Data-level Parallelismmentioning

confidence: 99%

“…In this paper, we present an extreme scale, rapidly converging implementation of an FMM-accelerated linear solver for wave scattering for the complex 3D Helmholtz Boundary Integral Equation (BIE). FMM is a very compute intensive algorithm [5] that is portable and adaptable to different levels of parallelism [3], and exhibits a scalable communication [4,39]. It is thus natural to rely upon such algorithm to accelerate the matrix-vector multiplication kernel to scale the application performance to a large number of tightly-coupled compute nodes.…”

mentioning

confidence: 99%

Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering

Abduljabbar¹,

Farhan²,

Al-Harthi³

et al. 2019

SIAM J. Sci. Comput.

Self Cite

View full text Add to dashboard Cite

Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels are capable of treating nontrivial singular and near-field integration points. We implement highly optimized kernels for both shared and distributed memory, targeting emerging Intel extreme performance HPC architectures. We extract the potential thread-and data-level parallelism of the key Helmholtz kernels of FMM. Our application code is well optimized to exploit the AVX-512 SIMD units of Intel Skylake and Knights Landing architectures. We provide different performance models for tuning the task-based tree traversal implementation of FMM, and develop optimal architecturespecific and algorithm aware partitioning, load balancing, and communication reducing mechanisms to scale up to 6,144 compute nodes of a Cray XC40 with 196,608 hardware cores. With shared memory optimizations, we achieve roughly 77% of peak single precision floating point performance of a 56-core Skylake processor, and on average 60% of peak single precision floating point performance of a 72-core KNL. These numbers represent nearly 5.4x and 10x speedup on Skylake and KNL, respectively, compared to the the baseline scalar code. With distributed memory optimizations, on the other hand, we report near-optimal efficiency in the weak scalability study with respect to both the O(log P ) communication complexity as well as the theoretical scaling complexity of FMM. In addition, we exhibit up to 85% efficiency in strong scaling. We compute in excess of 2 billion DoF on the full-scale of the Cray XC40 supercomputer. The numerical results match the analytical solution with convergence at 1.0e-4 relative 2-norm residual accuracy. To the best of our knowledge, this work presents the fastest and the most scalable FMM-accelerated linear solver for oscillatory kernels.

show abstract

“…Nevertheless, the everexpanding gap between the developing demands for massive computations and the languishing transistor budgets triggered by the "retirement" of Moore's Law has inevitably deteriorated the possible performance gains out of the architectural advancements in the hardware design. Therefore, finegrained parallelism (Abduljabbar et al, 2018) required at the node-level is becoming pervasive, especially since the performance of a compute node that powers the current and future supercomputers is highly dependent upon the performance provided by a tightly coupled specialized hardware for accelerator-driven computing (e.g., GPUs) connected directly to the compute node via a high-bandwidth, highspeed interconnect (e.g., NVIDIA NVLink) (Abduljabbar et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

MAGMA templates for scalable linear algebra on emerging architectures

Farhan

Abdelfattah

Tomov

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.

show abstract

Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Cited by 7 publications

References 13 publications

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering

MAGMA templates for scalable linear algebra on emerging architectures

Contact Info

Product

Resources

About