Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Chandramowlishwaran, Aparna; Williams, Samuel; Oliker, Leonid; Lashuk, Ilya; Biros, George; Vuduc, Richard

doi:10.1109/ipdps.2010.5470415

Cited by 44 publications

(53 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Originally envisioned to facilitate the optimization matrix-matrix multiplication, it has since been applied to a number of other computational kernels including sparse matrix-vector multiplication and the fast fourier transform [10,27,30]. Over the last decade, auto-tuning has expanded from simple loop transformation (loop blocking, unroll and jam) to include exploration of alternate data structures, optimizations for efficient shared memory parallelism (threading, data replication, data synchronization), and exploration of algorithmic parameters (particles per box in FMM, steps in communication-avoiding Krylov subspace methods) [4,15,18,27,35]. Additionally, auto-tuners have specialized to maximize performance or generality.…”

Section: Related Workmentioning

confidence: 99%

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Williams

Oliker

Carter

et al. 2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4× using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Williams

Oliker

Carter

et al. 2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…We believe that the achieved total run times are among the best ever reported for the FMM for the sizes of the problems considered (e.g, comparing with [10,11,13,14,27]). Fig.…”

Section: Multiple Heterogeneous Nodesmentioning

confidence: 70%

“…The FMM was considered on a cluster of GPUs in [11,10], and the benefits of architecture tuning on networks of multicore processors or GPUs was considered in [12,13,14]. In these papers adaptations of previous FMM algorithms were used, and impressive performance was achieved.…”

Section: Fast Multipole Methods and Scalabilitymentioning

confidence: 99%

Scalable fast multipole methods on distributed heterogeneous architectures

Gumerov

Duraiswami

2011

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideand-conquer algorithm that performs a fast N -body sum using a spatial decomposition and is often used in a timestepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N -body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.

show abstract

“…Efficient partitioning is necessary to reduce remote communication [30], [18]. However, most efforts in implementing parallel applications discover that application insights and awareness of the architecture are necessary to optimize the parallel implementation [26], [6], [11]. Parallel Discrete Event Simulation is difficult to parallelize because of its finegrained nature, and complex and dynamic dependency pattern [12], making it substantially different from typical parallel applications.…”

Section: Related Workmentioning

confidence: 99%

Optimization of Parallel Discrete Event Simulator for Multi-core Systems

Jagtap

Abu-Ghazaleh

Ponomarev

2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Parallel Discrete Event Simulation (PDES) can substantially improve performance and capacity of simulation, allowing the study of larger, more detailed models, in shorter times. PDES is a fine-grained parallel application whose performance and scalability are limited by communication latencies. Traditionally, PDES simulation kernels use processes that communicate using message passing; shared memory is used to optimize message passing for processes running on the same machine. We report on our experiences in implementing a thread-based version of the ROSS simulator. The multithreaded implementation eliminates multiple message copying and significantly minimizes synchronization delays. We study the performance of the simulator on two hardware platforms: a Core i7 machine and a 48-core AMD Opteron Magny-Cours system. We identify performance bottlenecks and propose and evaluate mechanisms to overcome them. Results show that multithreaded implementation improves performance over the MPI version by up to a factor of 3 for the Core i7 machine and 1.2 on Magny-cours for 48-way simulation.

show abstract

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Cited by 44 publications

References 19 publications

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Scalable fast multipole methods on distributed heterogeneous architectures

Optimization of Parallel Discrete Event Simulator for Multi-core Systems

Contact Info

Product

Resources

About