SUMMARYHPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space-time diagrams based on traces of asynchronous call path samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications.
Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC's initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.
This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and differentiable profile. Our toolkit operates on application binaries and succeeds in modeling key application characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architectureneutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.
Abstract-The potential for improving the performance of data-intensive scientific programs by enhancing data reuse in cache is substantial because CPUs are significantly faster than memory. Traditional performance tools typically collect or simulate cache miss counts or rates and attribute them at the function level. While such information identifies program scopes that exhibit a large cache miss rate, it is often insufficient to diagnose the causes for poor data locality and to identify what program transformations would improve memory hierarchy utilization. This paper describes an approach that uses memory reuse distance to identify an application's most significant memory access patterns causing cache misses and provide insight into ways of improving data reuse. Unlike previous approaches, our tool combines (1) analysis and instrumentation of fully optimized binaries, (2) online analysis of reuse patterns, (3) fine-grain attribution of measurements and models to statements, loops and variables, and (4) static analysis of access patterns to quantify spatial reuse. We demonstrate the effectiveness of our approach for understanding reuse patterns in two scientific codes: one for simulating neutron transport and a second for simulating turbulent transport in burning plasmas. Our tools pinpointed opportunities for enhancing data reuse. Using this feedback as a guide, we transformed the codes, reducing their misses at various levels of the memory hierarchy by integer factors and reducing their execution time by as much as 60% and 33%, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.