CHANGA is an N-body cosmology simulation application implemented using CHARM++. In this paper, we present the parallel design of CHANGA and address many challenges arising due to the high dynamic ranges of clustered datasets. We propose optimizations based on adaptive techniques. We evaluate the performance of CHANGA on highly clustered datasets: a z ∼ 0 snapshot of a 2 billion particle realization of a 25 Mpc volume, and a 52 million particle multi-resolution realization of a dwarf galaxy. For the 25 Mpc volume, we show strong scaling on up to 128K cores of Blue Waters. We also demonstrate scaling up to 128K cores of a multi-stepping run of the 2 billion particle simulation. While the scaling of the multi-stepping run is not as good as single stepping, the throughput at 128K cores is greater by a factor of 2. We also demonstrate strong scaling on up to 512K cores of Blue Waters for two large, uniform datasets with 12 and 24 billion particles.
The advent of petascale computing has introduced new challenges (e.g. heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede.
Abstract-This paper focuses on the use of GPGPU-based clusters for hierarchical N -body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel organization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effects of various application parameters are studied and experiments done to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present performance results from experiments on the NCSA Lincoln GPU cluster, including a note on GPU use in multistepped simulations.
The emergence of new parallel architectures presents new challenges for application developers. Supercomputers vary in processor speed, network topology, interconnect communication characteristics and memory subsystems. This paper presents a performance comparison of three of the fastest machines in the world: IBM’s Blue Gene/P installation at ANL (Intrepid), the SUN-Infiniband cluster at TACC (Ranger) and Cray’s XT4 installation at ORNL (Jaguar). Comparisons are based on three applications selected by NSF for the Track 1 proposal to benchmark the Blue Waters system: NAMD, MILC and a turbulence code, DNS. We present a comprehensive overview of the architectural details of each of these machines and a comparison of their basic performance parameters. Application performance is presented for multiple problem sizes and the relative performance on the selected machines is explained through micro-benchmarking results. We hope that insights from this work will be useful to managers making buying decisions for supercomputers and application users trying to decide on a machine to run on. Based on the performance analysis techniques used in the paper, we also suggest a step-by-step procedure for estimating the suitability of a given architecture for a highly parallel application.
Abstract-Fine-grained communication in supercomputing applications often limits performance through high communication overhead and poor utilization of network bandwidth. This paper presents Topological Routing and Aggregation Module (TRAM), a library that optimizes fine-grained communication performance by routing and dynamically combining short messages. TRAM collects units of fine-grained communication from the application and combines them into aggregated messages with a common intermediate destination. It routes these messages along a virtual mesh topology mapped onto the physical topology of the network. TRAM improves network bandwidth utilization and reduces communication overhead. It is particularly effective in optimizing patterns with global communication and large message counts, such as all-to-all and many-to-many, as well as sparse, irregular, dynamic or data dependent patterns. We demonstrate how TRAM improves performance through theoretical analysis and experimental verification using benchmarks and scientific applications. We present speedups on petascale systems of 6x for communication benchmarks and up to 4x for applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations –citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.