Topology-Aware Rank Reordering for MPI Collectives

Mirsadeghi, Seyed H.; Afsahi, Ahmad

doi:10.1109/ipdpsw.2016.139

Cited by 19 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hierarchical algorithms, as explored in [11]- [13], the multi-leader approach [14], and multi-lane communication methods [15], address bandwidth limitations inherent in electrical links. Topology-aware collective algorithms, such as HierKNEM [16] and Rank Reordering [17], aim to reduce link traversals in both intra-and inter-node communications. Approaches focusing on symmetric multiprocessing (SMP) and multi-core clusters [18]- [21], along with holistic optimization for various topologies [22]- [24], have also been investigated.…”

Section: Related Workmentioning

confidence: 99%

Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems

Dai,

Chen,

Huang

et al. 2024

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

In the realm of parallel and distributed computation, All-gather operation, a process where each node in a distributed system gathers data from all others, is pivotal. This operation underpins various high-performance computing (HPC) applications, notably in distributed deep learning (DL), by enabling model and hybrid parallelisms. Although optical interconnection networks promise unmatched bandwidth and reliability for data transfers between distributed nodes, most current All-gather algorithms remain optimized for electrical interconnects, leading to suboptimal performance in optical contexts. This paper proposes "OpTree", an advanced scheme distinctly designed for All-gather operation in optical interconnect systems. OpTree constructs an optimal m-ary tree that minimizes communication time by determining the optimal number of communication stages. A comprehensive comparison between OpTree's communication steps and existing All-gather algorithms is provided. Theoretical insights reveal that OpTree substantially curtails communication steps within optical interconnects. Constraints imposed by OpTree on optical communication are also elaborated. Empirical evaluations, through rigorous simulations, establish that: 1) OpTree is effective in generating an optimal m-ary tree for minimizing communication time. 2) For a 1024-node optical ring system, OpTree cuts communication time by 72.97%, 93.15%, and 86.32% against WRHT, Ring, and Neighbor Exchange (NE) schemes, respectively, tested over different message sizes.3) With varying node counts, the reductions stand at 42.27%, 92.74%, and 85.49% against the same counterparts. 4) As the number of wavelengths increases, communication time further diminishes.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems

Dai,

Chen,

Huang

et al. 2024

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

show abstract

“…On every step s, with 0 ≤ s < log 2 p a process with rank r will exchange data in a pairwise fashion with a process that has rank r ⊕ 2 s , where ⊕ represents the binary exclusive or. As all process send all the data received so far, the number of blocks double at every step and the cost of the Recursive Doubling is given by C rd = (log 2 p)α + (p − 1) m p β [16]. The operation of this algorithm is limited only to numbers of processes that are powers of two and thus this is the only case where it is employed both on MPICH and Open MPI.…”

Section: A Allgather Algorithmsmentioning

confidence: 99%

“…The works of [16] and [6] have a more Allgather focused approach, using its known communication pattern to create mappings more suited for the algorithms. The first proposes fine-tuned heuristics for Ring, Recursive Doubling and Binomial broadcast (a possible final component of an Allgather or Broadcast execution), with the experimental results presenting improvements up to 78%.…”

Section: Related Workmentioning

confidence: 99%

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Loch¹,

Koslovski²

2021

Preprint

View full text Add to dashboard Cite

The collective operations are considered critical for improving the performance of exascale-ready and highperformance computing applications. On this paper we focus on the Message-Passing Interface (MPI) Allgather many to many collective, which is amongst the most called and timeconsuming operations. Each MPI algorithm for this call suffers from different operational and performance limitations, that might include only working for restricted cases, requiring linear amounts of communication steps with the growth in number of processes, memory copies and shifts to assure correct data organization, and non-local data exchange patterns, most of which negatively contribute to the total operation time. All these characteristics create an environment where there is no algorithm which is the best for all cases and this consequently implies that careful choices of alternatives must be made to execute the call. Considering such aspects, we propose the Stripe Parallel Binomial Trees (Sparbit) algorithm, which has optimal latency and bandwidth time costs with no usage restrictions. It also maintains a much more local communication pattern that minimizes the delays due to long range exchanges, allowing the extraction of more performance from current systems when compared with asymptotically equivalent alternatives. On its best scenario, Sparbit surpassed the traditional MPI algorithms on 46.43% of test cases with mean (median) improvements of 34.7% (26.16%) and highest reaching 84.16%.

show abstract

“…There are approaches though, that do not carry this dependence. Authors in [24] for example, explore four heuristics, to perform rank reordering for realizing run-time topology awareness, for the case of the MPI Allgather primitive. The corresponding approach does not rely on an application's profile.…”

Section: Related Workmentioning

confidence: 99%

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

Vardas¹,

Ploumidis²,

Marazakis³

2020

Preprint

View full text Add to dashboard Cite

HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem. Apart from communication cost, the completion time of a job can also be affected by node failures. Node failures may result in job abortions, requiring job restarts. In this paper, we address the problem of assigning processes to system resources with the goal of reducing communication cost while also taking into account node failures. The proposed approach is integrated into the Slurm resource manager. Evaluation results show that, in scenarios where few nodes have a low outage probability, the proposed process placement approach achieves a notable decrease in the completion time of batches of MPI jobs. Compared to the default process placement approach in Slurm, the reduction is 18.9% and 31%, respectively for two different MPI applications.

show abstract

Topology-Aware Rank Reordering for MPI Collectives

Cited by 19 publications

References 17 publications

Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems

Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

Contact Info

Product

Resources

About