The advent of petascale computing has introduced new challenges (e.g. heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede.
A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume. This is especially true for communication graphs with a small number of neighbors per task. Routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology. This paper explores intelligent topology aware mappings of different communication patterns to the physical topology to identify cases that minimize link utilization. We also analyze the trade-offs between using direct and indirect routing with different mappings. We use simulations to study communication and overall performance of applications since there are no installations of two-level direct networks yet. This study raises interesting issues regarding the choice of job scheduling, routing and mapping for future machines.
Abstract-Performance of applications executed on large parallel systems suffer due to load imbalance. Load balancing is required to scale such applications to large systems. However, performing load balancing incurs a cost which may not be known a priori. In addition, application characteristics may change due to its dynamic nature and the parallel system used for execution. As a result, deciding when to balance the load to obtain the best performance is challenging. Existing approaches put this burden on the users, who rely on educated guess and extrapolation techniques to decide on a reasonable load balancing period, which may not be feasible and efficient.In this paper, we propose the Meta-Balancer framework which relieves the application programmers of deciding when to balance load. By continuously monitoring the application characteristics and using a set of guiding principles, MetaBalancer invokes load balancing on its own without any prior application knowledge. We demonstrate that Meta-Balancer improves or matches the best performance that can be obtained by fine tuning periodic load balancing. We also show that in some cases Meta-Balancer improves performance by 18% whereas periodic load balancing gives only a 1.5% benefit.
Currently deployed petascale supercomputers typically use toroidal network topologies in three or more dimensions. While these networks perform well for topology-agnostic codes on a few thousand nodes, leadership machines with 20,000 nodes require topology awareness to avoid network contention for communication-intensive codes. Topology adaptation is complicated by irregular node allocation shapes and holes due to dedicated input/output nodes or hardware failure. In the context of the popular molecular dynamics program NAMD, we present methods for mapping a periodic 3-D grid of fixed-size spatial decomposition domains to 3-D Cray Gemini and 5-D IBM Blue Gene/Q toroidal networks to enable hundred-million atom full machine simulations, and to similarly partition node allocations into compact domains for smaller simulations using multiple-copy algorithms. Additional enabling techniques are discussed and performance is reported for NCSA Blue Waters, ORNL Titan, ANL Mira, TACC Stampede, and NERSC Edison.
Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. In this paper, we attempt to model the performance of an application using communication data, such as the communication graph and network hardware counters. We use supervised learning algorithms, such as randomized decision trees, to correlate performance with prior and new metrics. We propose new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction. For three different communication patterns and a production application, we demonstrate a very strong correlation between the proposed metrics and the execution time of these codes.
Abstract-Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behavior in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and in turn, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of cores, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures. I. MOTIVATION AND IMPACTNetwork congestion is widely recognized as one of the primary causes of performance degradation, performance variability, and poor scaling in communication-heavy applications running on supercomputers [5]. However, due to the complex nature of interconnection networks, as well as message injection and routing strategies, network congestion and its root causes for network resources and hardware components are not well understood. This makes the problem of mitigating and avoiding network congestion difficult. It also complicates the task of writing congestionavoiding and congestion-minimizing algorithms for communication and task mapping. Therefore, we need new approaches to understand and model network congestion in order to improve the performance of large-scale parallel applications.When a message is sent from one node to another, it is split into packets that pass through many resources and hardware components on the network. A packet starts in an injection FIFO on the source. It then passes through multiple network links and receive buffers on intermediate nodes before it finally lands in the reception FIFO on the destination. When shared by multiple packets, any or all of these network components can slow down individual flits, packets and messages. This paper aims to identify the hardware components that affect the performance of sending a message the most.Our approach is based on using supervised machine learning to build models that map from independent variables, representing different network hardware components, to a dependent variable -the execution time of the application. We only consider computationally balanced, communicationheavy parallel applications and, hence, focus on the communication fraction of the total executi...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.