2017
DOI: 10.1007/978-3-319-64203-1_40
|View full text |Cite
|
Sign up to set email alerts
|

Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 13 publications
0
7
0
Order By: Relevance
“…This effectively means that the Intel compiler rather generates optimal vector code for the kernel. Hence, writing an intrinsics code for a routine, where the compiler successfully manages to vectorize it, is very often unnecessary [51]. …”
Section: Vectorization Efficiency Of the Edge-based Loopmentioning
confidence: 99%
“…This effectively means that the Intel compiler rather generates optimal vector code for the kernel. Hence, writing an intrinsics code for a routine, where the compiler successfully manages to vectorize it, is very often unnecessary [51]. …”
Section: Vectorization Efficiency Of the Edge-based Loopmentioning
confidence: 99%
“…In addition, AoS enhances the locality of references for interacting particles after they are sorted and indexed based on their Morton order. Cells maintain both indexes and struct particle_t { num SRC ; num COORD [3]; } __attribute__ (( aligned (64) ) ) ; struct cell_t { particle_t * b_ptr ; size_t b_count ; } __attribute__ (( aligned (64) ) ) ; particle_t * particles ; cell_t * cells ;…”
Section: Data-level Parallelismmentioning
confidence: 99%
“…In this paper, we present an extreme scale, rapidly converging implementation of an FMM-accelerated linear solver for wave scattering for the complex 3D Helmholtz Boundary Integral Equation (BIE). FMM is a very compute intensive algorithm [5] that is portable and adaptable to different levels of parallelism [3], and exhibits a scalable communication [4,39]. It is thus natural to rely upon such algorithm to accelerate the matrix-vector multiplication kernel to scale the application performance to a large number of tightly-coupled compute nodes.…”
mentioning
confidence: 99%
“…Nevertheless, the everexpanding gap between the developing demands for massive computations and the languishing transistor budgets triggered by the "retirement" of Moore's Law has inevitably deteriorated the possible performance gains out of the architectural advancements in the hardware design. Therefore, finegrained parallelism (Abduljabbar et al, 2018) required at the node-level is becoming pervasive, especially since the performance of a compute node that powers the current and future supercomputers is highly dependent upon the performance provided by a tightly coupled specialized hardware for accelerator-driven computing (e.g., GPUs) connected directly to the compute node via a high-bandwidth, highspeed interconnect (e.g., NVIDIA NVLink) (Abduljabbar et al, 2017).…”
Section: Introductionmentioning
confidence: 99%