A plug-and-play model for evaluating wavefront computations on parallel architectures

Mudalige, Gihan R.; Vernon, Mary K.; Jarvis, Stephen A.

doi:10.1109/ipdps.2008.4536243

Cited by 28 publications

(42 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One particular example is that several application models assume that broadcast or allreduce scale with Θ(S log(P )) (e.g., [3,17]) while, as demonstrated in Section 4, a good MPI implementation would implement a broadcast or allreduce with Θ(S + log(P )) [5,13,21]. Generally speaking, performance models for middleware libraries such as MPI depend on the parameters of the network (e.g., bandwidth, latency, topology, routing) and the implemented algorithms (e.g., collective algorithms, eager and rendezvous protocols) and are thus hard to generalize.…”

Section: Motivationmentioning

confidence: 99%

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Hoefler

Gropp

Thakur

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Designing and tuning parallel applications with MPI, particularly at large scale, requires understanding the performance implications of different choices of algorithms and implementation options. Which algorithm is better depends in part on the performance of the different possible communication approaches, which in turn can depend on both the system hardware and the MPI implementation. In the absence of detailed performance models for different MPI implementations, application developers often must select methods and tune codes without the means to realistically estimate the achievable performance and rationally defend their choices. In this paper, we advocate the construction of more useful performance models that take into account limitations on network-injection rates and effective bisection bandwidth. Since collective communication plays a crucial role in enabling scalability, we also provide analytical models for scalability of collective communication algorithms, such as broadcast, allreduce, and all-to-all. We apply these models to an IBM Blue Gene/P system and compare the analytical performance estimates with experimentally measured values. MotivationPerformance modeling of parallel applications leads to an understanding of their running time on parallel systems. To develop a model for an existing application or algorithm, one typically constructs a dependency graph of computations and communications from the start of the algorithm (input) to the end of the algorithm (output). This application model can then be matched to a machine model in order to estimate the run time of the algorithm on a particular architecture.Performance models can be used to make important early decisions about algorithmic choices. For example, to compute a three-dimensional Fast Fourier Transformation (3d FFT), one can either use a one-dimensional decomposition where each process computes full planes (2d FFTs) or a two-dimensional decomposition where each process computes sets of pencils (1d FFTs). If we assume

show abstract

Section: Motivationmentioning

confidence: 99%

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Hoefler

Gropp

Thakur

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…An iteration of Hydra employs several parallel functions, each of which consists of a number of the above operations. An aim of this work is to capture the time to solution by modelling the critical-path run-time of the code as demonstrated in previous analytic modelling research (Mudalige et al, 2008). We develop a general analytic model for the first two key operations (local computation, near-neighbour communication) before applying these to specific segments of the Hydra code.…”

Section: A Predictive Model For Hydramentioning

confidence: 99%

“…From the number of inter-and intra-node connections (equations 11 and 12), we can derive a model for near-neighbour communications in a similar fashion to that found in Mudalige et al (2008). This makes the assumption that the communication network is full-duplex and that the time for two nodes to perform a non-blocking send and receive is equivalent to the time for a single blocking send and receive because of Hydra's use of MPI_Waitall.…”

Section: Near Neighbour Point-to-pointmentioning

confidence: 99%

“…The development of such HPC codes, the evaluation of their performance on candidate systems and, sustaining performant execution, is a costly and time consuming exercise. To aid in these activities research has been conducted into developing accurate performance modelling tools and techniques for application analysis (Sundaram-Stukel and Vernon, 1999;Hoisie et al, 2000;Kerbyson et al, 2001;Mathis and Kerbyson, 2004;Mudalige et al, 2008).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Predictive analysis of a hydrodynamics application on large-scale CMP clusters

et al. 2011

View full text Add to dashboard Cite

We present the development of a predictive performance model for the high-performance computing code Hydra, a hydrodynamics benchmark developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The developed model elucidates the parallel computation of Hydra, with which it is possible to predict its run-time and scaling performance on varying large-scale chip multiprocessor (CMP) clusters. A key feature of the model is its granularity; with the model we are able to separate the contributing costs, including computation, point-topoint communications, collectives, message buffering and message synchronisation. The predictions are validated on two contrasting large-scale HPC systems, an AMD Opteron/ InfiniBand cluster and an IBM BlueGene/P, both of which are located at the Lawrence Livermore National Laboratory (LLNL) in the US. We validate the model on up to 2,048 cores, where it achieves a > 85% accuracy in weak-scaling studies. We also demonstrate use of the model in exposing the increasing costs of collectives for this application, and also the influence of node density on network accesses, therefore highlighting the impact of machine choice when running this hydrodynamics application at scale.

show abstract

“…Many HPC centres are therefore turning to alternative tools and methodologies (e.g. predictive performance modelling [1], [2], hardware simulation [3], [4] and mini-applications [5], [6]) to facilitate system evaluation, to aid in the comparison of multiple candidate machines, to investigate optimisation strategies, and to act as a vehicle for porting codes to novel architectures.…”

mentioning

confidence: 99%

Predictive Evaluation of Partitioning Algorithms through Runtime Modelling

Bunt

Wright

Jarvis

et al. 2016

2016 IEEE 23rd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

Abstract-Performance modelling unstructured mesh codes is a challenging process, due to the difficulty of capturing their memory access patterns, and their communication patterns at varying scale. In this paper we first develop extensions to an existing runtime performance model, aimed at overcoming the former, which we validate on up to 1,024 cores of a Haswellbased cluster, using both a geometric partitioning algorithm and ParMETIS to partition the input deck, with a maximum absolute runtime error of 12.63% and 11.55% respectively. To overcome the latter, we develop an application representative of the mesh partitioning process internal to an unstructured mesh code. This application is able to generate partitioning data that is usable with the performance model to produce predicted application runtimes within 7.31% of those produced using empirically collected data. We then demonstrate the use of the performance model by undertaking a predictive comparison among several partitioning algorithms on up to 30,000 cores. Additionally, we correctly predict the ineffectiveness of the geometric partitioning algorithm at 512 and 1024 cores.

show abstract

A plug-and-play model for evaluating wavefront computations on parallel architectures

Cited by 28 publications

References 9 publications

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Predictive analysis of a hydrodynamics application on large-scale CMP clusters

Predictive Evaluation of Partitioning Algorithms through Runtime Modelling

Contact Info

Product

Resources

About