Formal Modeling and Performance Evaluation of a Run-Time Rank Remapping Technique in Broadcast, Allgather and Allreduce MPI Collective Operations

Alvarez-Llorente, J.M.; Díaz‐Martín, Juan C.; Rico‐Gallego, Juan‐Antonio

doi:10.1109/ccgrid.2017.32

Cited by 4 publications

(7 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, this collective has a hierarchical design adapted to the target platform. The reduction is solved following a specific rank ordering in this hierarchical design [63] generating dependencies between intermediate calculations. This fact combined with several tasks sharing one CPU (oversubscription) may degrade performance proportional to the size of the application.…”

Section: Evaluation Resultsmentioning

confidence: 99%

Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Utrera

Farreras

Fornes

2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing "in place", taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%

Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Utrera

Farreras

Fornes

2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…On every step, a process with rank r will send a block to the process with rank r+1 and receive another from the process with rank r−1 (wrapping around if a destination or source is out of bounds). Each process's own block is sent on the first step, while on all others the block received on the previous step is forwarded [7]. Hereafter, the number of processes involved in the algorithm is represented by p, while m represents the total amount of data that a process must have at the end of the operation.…”

Section: A Allgather Algorithmsmentioning

confidence: 99%

“…The formal definitions of the algorithms assume equally balanced communication costs to all peers, but computing clusters and supercomputers often employ hierarchical network topologies [6]. On these networks the cost for performing communication between two nodes is highly dependent on the physical location of each peer [7], and the further away they are, the longer are the physical paths between them and therefore the higher the latency. From a bandwidth perspective, the further away two nodes are the higher is the chance that their communication will cross the core of the network, whose bandwidth is more expensive and supports less saturation than the edge [17], possibly leading to slowdowns or contentions.…”

Section: B Problem Formulationmentioning

confidence: 99%

“…The delay magnitude for the completion of a collective is a product of several factors, which include the underlying hardware topology [5], communication protocols, network capacity [3], placement of processes [6], [7] and others. However, one of paramount importance is the performance of the algorithm employed to coordinate the high level inter process communication and block transferences [8], [9].…”

Section: Introductionmentioning

confidence: 99%

“…For Allgather, the generally available algorithms are Ring, Neighbor Exchange, Bruck and Recursive Doubling. The Ring algorithm has a linear growth of both latency and bandwidth time with the increase in the number of processes [7]. Neighbor Exchange has the same asymptotic behaviour but with a less steep increase in time, requiring only half of the Ring's steps, with the downside of only working for even numbers of processes [11].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Loch¹,

Koslovski²

2021

Preprint

View full text Add to dashboard Cite

The collective operations are considered critical for improving the performance of exascale-ready and highperformance computing applications. On this paper we focus on the Message-Passing Interface (MPI) Allgather many to many collective, which is amongst the most called and timeconsuming operations. Each MPI algorithm for this call suffers from different operational and performance limitations, that might include only working for restricted cases, requiring linear amounts of communication steps with the growth in number of processes, memory copies and shifts to assure correct data organization, and non-local data exchange patterns, most of which negatively contribute to the total operation time. All these characteristics create an environment where there is no algorithm which is the best for all cases and this consequently implies that careful choices of alternatives must be made to execute the call. Considering such aspects, we propose the Stripe Parallel Binomial Trees (Sparbit) algorithm, which has optimal latency and bandwidth time costs with no usage restrictions. It also maintains a much more local communication pattern that minimizes the delays due to long range exchanges, allowing the extraction of more performance from current systems when compared with asymptotically equivalent alternatives. On its best scenario, Sparbit surpassed the traditional MPI algorithms on 46.43% of test cases with mean (median) improvements of 34.7% (26.16%) and highest reaching 84.16%.

show abstract

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Loch

Koslovski

2023

J Grid Computing

View full text Add to dashboard Cite

Formal Modeling and Performance Evaluation of a Run-Time Rank Remapping Technique in Broadcast, Allgather and Allreduce MPI Collective Operations

Cited by 4 publications

References 14 publications

Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Contact Info

Product

Resources

About