Abstract:In this report we describe how to improve communication time of MPI parallel applications with the use of a library that enables to monitor MPI applications and allows for introspection (the program itself can query the state of the monitoring system). Based on previous work, this library is able to see how collective communications are decomposed into point-to-point messages. It also features monitoring sessions that allow suspending and restarting the monitoring, limiting it to specific portions of the code.… Show more
“…There have been many efforts to optimize MPI communication. For example, MPI point-to-point communication routines can be optimized by using more efficient primitives [18], or through the use of a library for monitoring MPI applications [19]. MPI collective communications can be optimized over wide-area networks by considering network details [20], or through a library like HPC-X [21] for offloading.…”
The Message Passing Interface (MPI) is a crucial programming tool for enabling communication between processes in parallel applications. The goal of MPI users is to allocate tasks to processors in a way that maximizes both spatial and temporal locality in the network. However, this can be challenging, especially in large-scale networks where maximizing processor locality may not be feasible at runtime. To address this issue, we propose the use of Hamorder, an offline node reassignment approach that takes into account physical processor locations based on graph reordering for Random network topologies. Hamorder aims to optimize task mapping for improved performance in parallel applications, whether for multiple tasks or within a single task. Additionally, we investigate the potential of improving MPI applications through runtime parameter tuning based on Hamorder. Our evaluation results show that Hamorder provides a 27.3% improvement in performance compared to the Gorder algorithm on Random topologies, which is a state-ofthe-art solution designed with the aim of enhancing cache locality and achieves this goal by rearranging the vertices of a graph in a way that places the vertices that are typically accessed together in close proximity. Moreover, our autotuning framework using Hamorder results in an average speedup of 1.38x for targeted MPI applications by searching through various runtime parameter combinations.
“…There have been many efforts to optimize MPI communication. For example, MPI point-to-point communication routines can be optimized by using more efficient primitives [18], or through the use of a library for monitoring MPI applications [19]. MPI collective communications can be optimized over wide-area networks by considering network details [20], or through a library like HPC-X [21] for offloading.…”
The Message Passing Interface (MPI) is a crucial programming tool for enabling communication between processes in parallel applications. The goal of MPI users is to allocate tasks to processors in a way that maximizes both spatial and temporal locality in the network. However, this can be challenging, especially in large-scale networks where maximizing processor locality may not be feasible at runtime. To address this issue, we propose the use of Hamorder, an offline node reassignment approach that takes into account physical processor locations based on graph reordering for Random network topologies. Hamorder aims to optimize task mapping for improved performance in parallel applications, whether for multiple tasks or within a single task. Additionally, we investigate the potential of improving MPI applications through runtime parameter tuning based on Hamorder. Our evaluation results show that Hamorder provides a 27.3% improvement in performance compared to the Gorder algorithm on Random topologies, which is a state-ofthe-art solution designed with the aim of enhancing cache locality and achieves this goal by rearranging the vertices of a graph in a way that places the vertices that are typically accessed together in close proximity. Moreover, our autotuning framework using Hamorder results in an average speedup of 1.38x for targeted MPI applications by searching through various runtime parameter combinations.
“…Therefore, it is useful to enforce the mapping at runtime. For instance, if we see that in an application rank i and rank j communicate a lot, it is better to reorder the ranks such that the processes of rank i and j are close in the topology [30]. This might require to exchange some data.…”
“…Both proposals show improvements over default forms of mapping, however require profiling information. In turn, the work of [22] proposes online monitoring and rank remapping that provide improvements and does not need prior executions, however it still requires active modification of application code. Sparbit could be potentially coupled with these techniques, however its main advantage in comparison is that it works out of the box, providing significant improvements on communication time for theoretically any hierarchical network, and without need for topology information, additional communication or computation.…”
The collective operations are considered critical for improving the performance of exascale-ready and highperformance computing applications. On this paper we focus on the Message-Passing Interface (MPI) Allgather many to many collective, which is amongst the most called and timeconsuming operations. Each MPI algorithm for this call suffers from different operational and performance limitations, that might include only working for restricted cases, requiring linear amounts of communication steps with the growth in number of processes, memory copies and shifts to assure correct data organization, and non-local data exchange patterns, most of which negatively contribute to the total operation time. All these characteristics create an environment where there is no algorithm which is the best for all cases and this consequently implies that careful choices of alternatives must be made to execute the call. Considering such aspects, we propose the Stripe Parallel Binomial Trees (Sparbit) algorithm, which has optimal latency and bandwidth time costs with no usage restrictions. It also maintains a much more local communication pattern that minimizes the delays due to long range exchanges, allowing the extraction of more performance from current systems when compared with asymptotically equivalent alternatives. On its best scenario, Sparbit surpassed the traditional MPI algorithms on 46.43% of test cases with mean (median) improvements of 34.7% (26.16%) and highest reaching 84.16%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.