Model-based selection of optimal MPI broadcast algorithms for multi-core clusters

Nuriyev, Emin; Rico‐Gallego, Juan‐Antonio; Lastovetsky, Alexey

doi:10.1016/j.jpdc.2022.03.012

Cited by 9 publications

(5 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, by having a local pointer to the beginning of this shared memory area, any process on the same node can independently access the broadcast data. Since the size of the broadcast message remains consistent with that of the pure MPI broadcast as shown in figure 3, performing the across-node broadcast operation across all the roots becomes straightforward in this scenario [13,24]. Here we just start the passive target synchronization epoch for all processes and assign the data to all processes locations as shown in figure 4, corresponding to the window win.…”

Section: Rma Broadcastmentioning

confidence: 99%

“…At the first round in figure 6 process 0 put the buffer b in the memory of processes 1 and 2, at the secondround process 1 put the buffer in the memory of processes 3 and 4, for the third round process 2 puts the message in the buffer of 5 and 6. Finally, at the fourth round, process 3 puts the message in the memory of process 7 [13]. The height of the binary tree is equal to TTotal = T log2(p) at each round the maximum number is 2 i for i is the round number.…”

Section: Binary Tree Algorithmmentioning

confidence: 99%

“…There are two fundamental drawbacks of broadcasting down a binomial tree. First, when the communicator size is not a correct power of two, the communication time is obviously out of balance [13]. To finish the task, the final step (for instance, rank 7 in figure 8) requires log2p communication rounds.…”

Section: Binomial Tree Algorithmmentioning

confidence: 99%

“…In essence, PGAS languages represent a novel approach to parallel programming, streamlining the process by automating communication between processes. However, this automation places a greater onus on the compiler to optimize the code for efficient execution [13].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Enhancing MPI remote memory access model for distributed-memory systems through one-sided broadcast implementation

Abuelsoud,

Kogutenko,

Naveen

2024

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

Efficiently processing vast and expanding data volumes is a pressing challenge. Traditional high-performance computers, utilizing distributed-memory architecture and a message-passing model, grapple with synchronization issues, hampering their ability to keep up with the growing demands. Remote Memory Access (RMA), often referred to as one-sided MPI communications, offers a solution by allowing a process to directly access another process’s memory, eliminating the need for message exchange and significantly boosting performance. Unfortunately, the existing MPI RMA standard lacks a collective operation interface, limiting efficiency. To overcome this constraint, we introduce an algorithm design that enables efficient parallelizable collective operations within the RMA framework. Our study focuses primarily on the advantages of collective operations, using the broadcast algorithm as a case study. Our implementations surpass traditional methods, highlighting the promising potential of this technique, as indicated by initial performance tests.

show abstract

Section: Rma Broadcastmentioning

confidence: 99%

Section: Binary Tree Algorithmmentioning

confidence: 99%

Section: Binomial Tree Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Enhancing MPI remote memory access model for distributed-memory systems through one-sided broadcast implementation

Abuelsoud,

Kogutenko,

Naveen

2024

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

show abstract

“…In addition, the complexity of peer-to-peer communication becomes O(n); however, when utilizing the self-generation concept, the data to be transmitted in the forward step is 0, and this is replaced by broadcast set communication in the synchronization step. In terms of the broadcast process, the complexity can be reduced to O(log n) using tree algorithms [26,27]. Consequently, due to the optimized design of the prediction time and the reduced communication time, the proposed PPRN method can accelerate the pipeline with much less overhead than CPPipe.…”

Section: Reduced Network Communicationsmentioning

confidence: 99%

Pipeline Parallelism with Reduced Network Communications for Efficient Compute-intensive Neural Network Training

Yu,

Park

2023

Preprint

View full text Add to dashboard Cite

Pipeline parallelism is a distributed deep neural network training method suitable for tasks that consume large amounts of memory. However, this method entails a large amount of overhead because of the dependency between devices in performing forward and backward steps through multiple devices. A method to remove forward step dependency through the all-to-all approach has been proposed for the compute-intensive models; however, this method incurs large overhead when training with a large number of devices and is inefficient in terms of weight memory consumption. Therefore, we propose a pipeline parallelism method that reduces network communication using a self-generation concept and simultaneously reduces overhead by minimizing the weight memory used for acceleration. In a Darknet53 training throughput experiment using six devices, the proposed method showed excellent performance of approximately 63.7% compared to the baseline by reduced overhead and communication costs and showed less memory consumption of approximately 17.0%.

show abstract

Algorithm Selection of MPI Collectives Considering System Utilization

Salimi Beni,

Hunold,

Cosenza

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Model-based selection of optimal MPI broadcast algorithms for multi-core clusters

Cited by 9 publications

References 37 publications

Enhancing MPI remote memory access model for distributed-memory systems through one-sided broadcast implementation

Enhancing MPI remote memory access model for distributed-memory systems through one-sided broadcast implementation

Pipeline Parallelism with Reduced Network Communications for Efficient Compute-intensive Neural Network Training

Algorithm Selection of MPI Collectives Considering System Utilization

Contact Info

Product

Resources

About