Abstract-Network contention has a significantly adverse effect on the performance of parallel applications with increasing size of parallel machines. Machines of the petascale era are forcing application developers to map tasks intelligently to job partitions to achieve the best performance possible. This paper presents a framework for automated mapping of parallel applications with regular communication graphs to two and three dimensional mesh and torus networks. This framework will save much effort on the part of application developers to generate mappings for their individual applications.One component of the framework is a process topology analyzer to find regular patterns and if found, to determine the dimensions of the communication graphs of applications. The other component is a suite of heuristic techniques for mapping 2D object grids to 2D and 3D processor meshes. The framework chooses the best heuristic from the suite for a given object grid and processor mesh pair based on the hop-bytes metric. We show performance improvements using the framework, for a 2D Stencil benchmark in MPI and the Weather Research and Forecasting model running on the IBM Blue Gene/P. We also compare our algorithms with others discussed in literature.
A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume. This is especially true for communication graphs with a small number of neighbors per task. Routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology. This paper explores intelligent topology aware mappings of different communication patterns to the physical topology to identify cases that minimize link utilization. We also analyze the trade-offs between using direct and indirect routing with different mappings. We use simulations to study communication and overall performance of applications since there are no installations of two-level direct networks yet. This study raises interesting issues regarding the choice of job scheduling, routing and mapping for future machines.
Large parallel machines with hundreds of thousands of processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with a relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to take longer to arrive at good solutions. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and longer running times of traditional distributed schemes. Our solution overcomes these issues by creating multiple levels of load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in CHARM++. We discuss techniques to deal with scalability challenges of load balancing at very large scale. We present performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at the Texas Advanced Computing Center) and 65,536 cores of Intrepid (the Blue Gene/P at Argonne National Laboratory) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD, with results on Intrepid.
Abstract-Mapping of parallel applications on the network topology is becoming increasingly important on large supercomputers. Topology aware mapping can reduce the hops traveled by messages on the network and hence reduce contention, which can lead to improved performance. This paper discusses heuristic techniques for mapping applications with irregular communication graphs to mesh and torus topologies. Parallel codes with irregular communication constitute an important class of applications. Unstructured grid applications are a classic example of codes with irregular communication patterns. Since the mapping problem is NP-hard, this paper presents fast heuristic-based algorithms. These heuristics are part of a larger framework for automatic mapping of parallel applications. We evaluate the heuristics in this paper in terms of the reduction in average hops per byte. The heuristics discussed here are applicable to most parallel applications since irregular graphs constitute the most general category of communication patterns. Some heuristics can also be easily extended to other network topologies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.