2010 IEEE International Symposium on Parallel &Amp; Distributed Processing (IPDPS) 2010
DOI: 10.1109/ipdps.2010.5470415
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Abstract: This work presents the first extensive study of singlenode performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems. We consider single-and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning.Among our numerous findings, we show that optimization and parallelization can improve doubleprecision performance by 25× on Intel's … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
53
0

Year Published

2010
2010
2021
2021

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 44 publications
(53 citation statements)
references
References 19 publications
0
53
0
Order By: Relevance
“…Originally envisioned to facilitate the optimization matrix-matrix multiplication, it has since been applied to a number of other computational kernels including sparse matrix-vector multiplication and the fast fourier transform [10,27,30]. Over the last decade, auto-tuning has expanded from simple loop transformation (loop blocking, unroll and jam) to include exploration of alternate data structures, optimizations for efficient shared memory parallelism (threading, data replication, data synchronization), and exploration of algorithmic parameters (particles per box in FMM, steps in communication-avoiding Krylov subspace methods) [4,15,18,27,35]. Additionally, auto-tuners have specialized to maximize performance or generality.…”
Section: Related Workmentioning
confidence: 99%
“…Originally envisioned to facilitate the optimization matrix-matrix multiplication, it has since been applied to a number of other computational kernels including sparse matrix-vector multiplication and the fast fourier transform [10,27,30]. Over the last decade, auto-tuning has expanded from simple loop transformation (loop blocking, unroll and jam) to include exploration of alternate data structures, optimizations for efficient shared memory parallelism (threading, data replication, data synchronization), and exploration of algorithmic parameters (particles per box in FMM, steps in communication-avoiding Krylov subspace methods) [4,15,18,27,35]. Additionally, auto-tuners have specialized to maximize performance or generality.…”
Section: Related Workmentioning
confidence: 99%
“…We believe that the achieved total run times are among the best ever reported for the FMM for the sizes of the problems considered (e.g, comparing with [10,11,13,14,27]). Fig.…”
Section: Multiple Heterogeneous Nodesmentioning
confidence: 70%
“…The FMM was considered on a cluster of GPUs in [11,10], and the benefits of architecture tuning on networks of multicore processors or GPUs was considered in [12,13,14]. In these papers adaptations of previous FMM algorithms were used, and impressive performance was achieved.…”
Section: Fast Multipole Methods and Scalabilitymentioning
confidence: 99%
“…Efficient partitioning is necessary to reduce remote communication [30], [18]. However, most efforts in implementing parallel applications discover that application insights and awareness of the architecture are necessary to optimize the parallel implementation [26], [6], [11]. Parallel Discrete Event Simulation is difficult to parallelize because of its finegrained nature, and complex and dynamic dependency pattern [12], making it substantially different from typical parallel applications.…”
Section: Related Workmentioning
confidence: 99%