MetaMorph: A Library Framework for Interoperable Kernels on Multi- and Many-Core Clusters

Helal, Ahmed E.; Tech, Virginia; Sathre, Paul; Feng, Wu-chun

doi:10.1109/sc.2016.10

Cited by 4 publications

(1 citation statement)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Integrating CommAnalyzer with existing scalability analysis tools enables the estimation of the optimal number of the required compute nodes for each application to achieve high HPC system utilization. Furthermore, CommAnalyzer is a perfect candidate to drive a communication-aware workload distribution scheme to efficiently utilize the available compute resources across multiple HPC nodes [26,28], laying the foundation for addressing the communication challenges of Exascale computing.…”

Section: Discussionmentioning

confidence: 99%

CommAnalyzer

Helal

Jung

Feng

et al. 2018

Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

To deliver scalable performance to large-scale scientific and data analytic applications, HPC cluster architectures adopt the distributedmemory model. The performance and scalability of parallel applications on such systems are limited by the communication cost across compute nodes. Therefore, projecting the minimum communication cost and maximum scalability of the user applications plays a critical role in assessing the benefits of porting these applications to HPC clusters as well as developing efficient distributed-memory implementations. Unfortunately, this task is extremely challenging for end users, as it requires comprehensive knowledge of the target application and hardware architecture and demands significant effort and time for manual system analysis.To streamline the process of porting user applications to HPC clusters, this paper presents CommAnalyzer, an automated framework for estimating the communication cost on distributed-memory models from sequential code. CommAnalyzer uses novel dynamic program analyses and graph algorithms to capture the inherent flow of program values (information) in sequential code to estimate the communication when this code is ported to HPC clusters. Therefore, CommAnalyzer makes it possible to project the efficiency/scalability upper-bound (i.e., Roofline) of the effective distributed-memory implementation before even developing one. The experiments with real-world, regular and irregular HPC applications demonstrate the utility of CommAnalyzer in estimating the minimum communication of sequential applications on HPC clusters. In addition, the optimized MPI+X implementations achieve more than 92% of the efficiency upper-bound across the different workloads. INTRODUCTIONIn order to scale to a large number of compute units, HPC cluster architectures adopt the distributed-memory model. These architectures are more difficult to program than shared-memory models and require explicit decomposition and distribution of the program data and computations, due to the lack of a single global address space. The MPI programming model is the de facto standard for programming applications on HPC clusters [6,18]. MPI uses explicit messaging to exchange data across processes that reside in separate address spaces, and it is often combined with shared-memory

show abstract

Section: Discussionmentioning

confidence: 99%