An introductory exascale feasibility study for FFTs and multigrid

Gahvari, Hormozd; Gropp, William

doi:10.1109/ipdps.2010.5470417

Cited by 21 publications

(21 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, we only consider the pencil decomposition of the transpose method for computing a 3D FFT. Other distributed FFT algorithms exist, but for realistic problem sizes on current and future large-scale systems the pencil decomposition is the best option [17,22]. We also assume ideal problem sizes: n is a power of two and all dimensions are equally sized.…”

Section: Limitationsmentioning

confidence: 99%

“…The second study is Gahvari's and Gropp's theoretical analysis of feasible latency and bandwidth regimes at exascale, using LogGP modeling and pencil/transpose-based FFTs as one benchmark [9,22]. Their model is more general than ours in that it is agnostic about specific architectural forms at exascale; however, ours may be more prescriptive about the necessary changes by explicitly modeling particular architectural features.…”

Section: Related Workmentioning

confidence: 99%

“…To obtain this estimate, we extrapolated the various system parameters of Table 1, used them to determine the form (e.g., number of nodes) required to get a system running at 4 EF/s, selected a problem size according to the methodology of Gahvari and Gropp [22], and then evaluated our performance model to estimate execution time. Prediction 2 below further elaborates on this calculation.…”

Section: Predictionsmentioning

confidence: 99%

See 2 more Smart Citations

On the communication complexity of 3D FFTs and its implications for Exascale

Czechowski

Battaglino

McClanahan

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

This paper revisits the communication complexity of largescale 3D fast Fourier transforms (FFTs) and asks what impact trends in current architectures will have on FFT performance at exascale. We analyze both memory hierarchy traffic and network communication to derive suitable analytical models, which we calibrate against current software implementations; we then evaluate models to make predictions about potential scaling outcomes at exascale, based on extrapolating current technology trends. Of particular interest is the performance impact of choosing high-density processors, typified today by graphics co-processors (GPUs), as the base processor for an exascale system. Among various observations, a key prediction is that although inter-node all-to-all communication is expected to be the bottleneck of distributed FFTs, intra-node communication-expressed precisely in terms of the relative balance among compute capacity, memory bandwidth, and network bandwidth-will play a critical role.

show abstract

Section: Limitationsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Predictionsmentioning

confidence: 99%

See 1 more Smart Citation

On the communication complexity of 3D FFTs and its implications for Exascale

Czechowski

Battaglino

McClanahan

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…PME calculation, due to the data transposes required in three dimensional Fourier transforms, is highly communication intensive [22], and therefore very challenging to scale. While NAMD supports both slab (one dimensional decomposition) and pencil (two dimensional decomposition) PME, this paper addresses only the pencil form due to its superior scaling characteristics [6].…”

Section: Namdmentioning

confidence: 99%

“…Constructing the grid and extracting the result from it is shown at left and the 3-D FFT forward and backward at right. Pencil based distributed parallel implementations of 3-D FFT have communication requirements that are well studied in the literature [6], so we present a minimal summary of the critical issues for completeness. Furthermore, the communication process from reciprocal space to real space is the reverse of the real to reciprocal process, therefore only the forward path will be considered in detail.…”

Section: Namdmentioning

confidence: 99%

Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6

Sun

Zheng

Mei

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Abstract-Achieving good scaling for fine-grained communication intensive applications on modern supercomputers remains challenging. In our previous work, we have shown that such an application -NAMD -scales well on the full Jaguar XT5 without long-range interactions; Yet, with them, the speedup falters beyond 64K cores. Although the new Gemini interconnect on Cray XK6 has improved network performance, the challenges remain, and are likely to remain for other such networks as well. We analyze communication bottlenecks in NAMD and its CHARM++ runtime, using the Projections performance analysis tool. Based on the analysis, we optimize the runtime, built on the uGNI library for Gemini. We present several techniques to improve the fine-grained communication. Consequently, the performance of running 92224-atom Apoa1 with GPUs on TitanDev is improved by 36%. For 100-million-atom STMV, we improve upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6.

show abstract

Optimization of multigrid based elliptic solver for large scale simulations in the FLASH code

Daley

Vanella

Dubey

et al. 2012

Concurrency and Computation

View full text Add to dashboard Cite

FLASH is a multiphysics multiscale adaptive mesh refinement (AMR) code originally designed for simulation of reactive flows often found in Astrophysics. With its wide user base and flexible applications configuration capability, FLASH has a dual task of maintaining scalability and portability in all its solvers. The scalability of fully explicit solvers in the code is tied very closely to that of the underlying mesh. Others such as the Poisson solver based on a multigrid method have more complex scaling behavior. Multigrid methods suffer from processor starvation and dominating communication costs at coarser grids with increase in the number of processors. In this paper, we propose a combination of uniform grid mesh with AMR mesh, and the merger of two different sets of solvers to overcome the scalability limitation of the Poisson solver in FLASH. The principal challenge in the proposed merger is the efficiency of the communication algorithm to map the mesh back and forth between uniform grid and AMR. We present two different parallel mapping algorithms and also discuss results from performance studies of the two implementations. OPTIMIZATION OF MULTIGRID BASED ELLIPTIC SOLVER IN THE FLASH CODE 2347(AMR) package in FLASH. In AMR meshes, the limit to scalability comes from refinement, essentially a global process. Additionally, AMR uses space-filling curves such as Morton ordering to achieve a reasonable balance between proximity and load balancing at the cost of a more complex neighborhood distribution for individual blocks. This can have an impact on scaling especially when block-count per processor is low. The scaling behavior also depends on the frequency of refinement events and the underlying interconnect architecture of the target high performance computing (HPC) platform. Machines such as Blue-gene/P, which have a fast global interconnect, show essentially similar weak scaling behavior for the explicit solvers irrespective of the frequency of refinement. On Cray-XT machines, with their known scaling limits for global operations, the solvers scale well when refinement is infrequent, but begin to show scaling limitations at larger processor counts when refinement is frequent [6].In addition to the explicit solvers, FLASH also supports a multigrid Poisson solver, which is the focus of this paper. The multigrid solver in FLASH is used for two main purposes; for computing self-gravity in cosmological simulations [7,8], and for computing pressure in the incompressible Navier-Stokes solvers used in fluid-structure interaction applications described in [9,10]. The multigrid implementation in FLASH was imported from the solver described in [11], which in turn is adapted from [12]. The solver is currently limited to Cartesian coordinates in a rectangular boxlike domain. However, for the two classes of applications described earlier, this is not a limitation. Cosmology simulations, especially in three-dimensions (3D), are typically performed on a rectangular box. In fluid structure interaction problems, because of th...

show abstract

An introductory exascale feasibility study for FFTs and multigrid

Cited by 21 publications

References 6 publications

On the communication complexity of 3D FFTs and its implications for Exascale

On the communication complexity of 3D FFTs and its implications for Exascale

Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6

Optimization of multigrid based elliptic solver for large scale simulations in the FLASH code

Contact Info

Product

Resources

About