Improving MPI Application Communication Time with an Introspection Monitoring Library

Jeannot, Emmanuel; Sartori, Richard

doi:10.1109/ipdpsw50202.2020.00124

Cited by 6 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There have been many efforts to optimize MPI communication. For example, MPI point-to-point communication routines can be optimized by using more efficient primitives [18], or through the use of a library for monitoring MPI applications [19]. MPI collective communications can be optimized over wide-area networks by considering network details [20], or through a library like HPC-X [21] for offloading.…”

Section: Tuning Of Mpi Applicationsmentioning

confidence: 99%

Accelerating Parallel Applications Based on Graph Reordering for Random Network Topologies

2023

IEEE Access

View full text Add to dashboard Cite

The Message Passing Interface (MPI) is a crucial programming tool for enabling communication between processes in parallel applications. The goal of MPI users is to allocate tasks to processors in a way that maximizes both spatial and temporal locality in the network. However, this can be challenging, especially in large-scale networks where maximizing processor locality may not be feasible at runtime. To address this issue, we propose the use of Hamorder, an offline node reassignment approach that takes into account physical processor locations based on graph reordering for Random network topologies. Hamorder aims to optimize task mapping for improved performance in parallel applications, whether for multiple tasks or within a single task. Additionally, we investigate the potential of improving MPI applications through runtime parameter tuning based on Hamorder. Our evaluation results show that Hamorder provides a 27.3% improvement in performance compared to the Gorder algorithm on Random topologies, which is a state-ofthe-art solution designed with the aim of enhancing cache locality and achieves this goal by rearranging the vertices of a graph in a way that places the vertices that are typically accessed together in close proximity. Moreover, our autotuning framework using Hamorder results in an average speedup of 1.38x for targeted MPI applications by searching through various runtime parameter combinations.

show abstract

Section: Tuning Of Mpi Applicationsmentioning

confidence: 99%

Accelerating Parallel Applications Based on Graph Reordering for Random Network Topologies

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Therefore, it is useful to enforce the mapping at runtime. For instance, if we see that in an application rank i and rank j communicate a lot, it is better to reorder the ranks such that the processes of rank i and j are close in the topology [30]. This might require to exchange some data.…”

Section: Rank Reorderingmentioning

confidence: 99%

Process mapping on any topology with TopoMatch

Jeannot¹

2022

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

“…Both proposals show improvements over default forms of mapping, however require profiling information. In turn, the work of [22] proposes online monitoring and rank remapping that provide improvements and does not need prior executions, however it still requires active modification of application code. Sparbit could be potentially coupled with these techniques, however its main advantage in comparison is that it works out of the box, providing significant improvements on communication time for theoretically any hierarchical network, and without need for topology information, additional communication or computation.…”

Section: Related Workmentioning

confidence: 99%

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Loch¹,

Koslovski²

2021

Preprint

View full text Add to dashboard Cite

The collective operations are considered critical for improving the performance of exascale-ready and highperformance computing applications. On this paper we focus on the Message-Passing Interface (MPI) Allgather many to many collective, which is amongst the most called and timeconsuming operations. Each MPI algorithm for this call suffers from different operational and performance limitations, that might include only working for restricted cases, requiring linear amounts of communication steps with the growth in number of processes, memory copies and shifts to assure correct data organization, and non-local data exchange patterns, most of which negatively contribute to the total operation time. All these characteristics create an environment where there is no algorithm which is the best for all cases and this consequently implies that careful choices of alternatives must be made to execute the call. Considering such aspects, we propose the Stripe Parallel Binomial Trees (Sparbit) algorithm, which has optimal latency and bandwidth time costs with no usage restrictions. It also maintains a much more local communication pattern that minimizes the delays due to long range exchanges, allowing the extraction of more performance from current systems when compared with asymptotically equivalent alternatives. On its best scenario, Sparbit surpassed the traditional MPI algorithms on 46.43% of test cases with mean (median) improvements of 34.7% (26.16%) and highest reaching 84.16%.

show abstract

Improving MPI Application Communication Time with an Introspection Monitoring Library

Cited by 6 publications

References 18 publications

Accelerating Parallel Applications Based on Graph Reordering for Random Network Topologies

Accelerating Parallel Applications Based on Graph Reordering for Random Network Topologies

Process mapping on any topology with TopoMatch

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Contact Info

Product

Resources

About