SUMMARYHPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space-time diagrams based on traces of asynchronous call path samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications.
In 1998, Numrich and Reid proposed Coarray Fortran as a simple set of extensions to Fortran 95 [7]. Their principal extension to Fortran was support for shared data known as coarrays. In 2005, the Fortran Standards Committee began exploring the addition of coarrays to Fortran 2008, which is now being finalized. Careful review of drafts of the emerging Fortran 2008 standard led us to identify several shortcomings with the proposed coarray extensions. In this paper, we briefly critique the coarray extensions proposed for Fortran 2008, outline a new vision for coarrays in Fortran language that is far more expressive, and briefly describe our strategy for implementing the language extensions that we propose.
Abstract-Applications must scale well to make efficient use of today's class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of scaling problems.Load imbalance is one of the most common scaling problems. To provide actionable insight into load imbalance, we present post-mortem parallel analysis techniques for pinpointing and quantifying load imbalance in the context of call path profiles of parallel programs. We show how to identify load imbalance in its static and dynamic context by using only low-overhead asynchronous call path profiling to locate regions of code responsible for communication wait time in SPMD executions. We describe the implementation of these techniques within HPCTOOLKIT.
Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists to understand impediments to performance on emerging petascale systems. In this paper, we describe HPCToolkit-a suite of multi-platform tools that supports sampling-based analysis of application performance on emerging petascale platforms. HPCToolkit uses sampling to pinpoint and quantify both scaling and node performance bottlenecks. We study several emerging petascale applications on the Cray XT and IBM BlueGene/P platforms and use HPCToolkit to identify specific source lines -in their full calling context -associated with performance bottlenecks in these codes. Such information is exactly what application developers need to know to improve their applications to take full advantage of the power of petascale systems.
Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks.Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained contextsensitive data -the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context-and time-sensitive data.This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation. To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.
Abstract. As part of the U.S. Department of Energy's Scientific Discovery through Advanced Computing (SciDAC) program, science teams are tackling problems that require simulation and modeling on petascale computers. As part of activities associated with the SciDAC Center for Scalable Application Development Software (CScADS) and the Performance Engineering Research Institute (PERI), Rice University is building software tools for performance analysis of scientific applications on the leadership-class platforms. In this poster abstract, we briefly describe the HPCToolkit performance tools and how they can be used to pinpoint bottlenecks in SPMD and multi-threaded parallel codes. We demonstrate HPCToolkit's utility by applying it to two SciDAC applications: the S3D code for simulation of turbulent combustion and the MFDn code for ab initio calculations of microscopic structure of nuclei. IntroductionHarnessing the power of emerging petascale computational platforms to aid scientific discovery poses a grand challenge for computer science. Programming petascale systems requires both exploiting coarse-grained parallelism across nodes and making effective use of individual microprocessor-based nodes. Performance tools that help pinpoint bottlenecks and quantify opportunities for improvement are essential for tuning codes to use these platforms efficiently.As part of activities associated with the SciDAC Center for Scalable Application Development Software (CScADS) and the Performance Engineering Research Institute (PERI), Rice University is developing HPCToolkit [1], a performance toolkit for accurately measuring and pinpointing performance bottlenecks. HPCToolkit uses novel techniques for measurement and analysis of parallel programs. In particular, it uses statistical sampling of hardware performance counters and attributes metrics to both the calling context in which they occur and program structure, including loops and inlined procedures. We describe here how we use HPCToolkit to pinpoint scalability and multicore performance bottlenecks. We demonstrate HPCToolkit's utility by applying it to two SciDAC applications: the S3D code for simulation of turbulent combustion [2] and the MFDn code for ab initio calculations of microscopic structure of nuclei [3]. Although we focus here on measurement and analysis to support manual tuning, the precise characterization of performance bottlenecks that we obtain is a prerequisite for more automatic approaches based on feedback-directed optimization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.