Design and Implementation of a Hybrid Parallel Performance Measurement System

Morris, Alan; Malony, Allen D.; Shende, Sameer; Huck, Kevin

doi:10.1109/icpp.2010.57

Cited by 21 publications

(14 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A lot of work has been done to reduce instrumentation overhead [8]- [10], [21], [26], [34]- [36] and efficiently store these data in various formats, relying on parallel IO libraries [2]. Our work differs in the sense that overcoming IO limitations is achieved by replacing them with a more effective coupling mechanism.…”

Section: Related Workmentioning

confidence: 96%

See 1 more Smart Citation

Event Streaming for Online Performance Measurements Reduction

Besnard

Pérache

Jalby

2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

As the power of supercomputers is exponentially increasing, programmers are facing complex codes designed to comply with today's challenging architectural constraints. In such context, the use of tools within the development cycle, is becoming crucial in order to optimise applications at scale. However, it is not possible to obtain all measurements one can think of, because of the cost to produce, store and analyse large amounts of instrumentation-data. Moreover, the file-system is becoming a critical resource, subject to performance and even stability problems under load. This emphasises the need for an alternative approach to trace data management. This paper proposes an alternative to trace-based coupling between instrumentation and analysis. We present a distributed analysis engine, providing concurrent application profiling, thanks to runtime coupling. After demonstrating the advantages of this method in terms of parallelism, we present performance results and sample outputs for NAS-MPI benchmarks and a representative C++ MPI application.

show abstract

Section: Related Workmentioning

confidence: 96%

“…Naturally, embedded instrumentation cannot perform complex analysis without impacting the target program. Therefore performance tools such as Scalasca [8], [21] or Tau [9], [26] combine several online reduction techniques with post-mortem analysis.…”

Section: Related Workmentioning

confidence: 99%

Event Streaming for Online Performance Measurements Reduction

Besnard

Pérache

Jalby

2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Not only does this expand the scope of observable performance, but there are also opportunities for hybrid measurement. We were successful in building an initial version of the (so-called) TAUebs system during the year and published the work in the ICPP conference [11]. The implementation provided tracing support only, but this was enough to demonstrate important new capabilities to work with applications such as MADNESS, which proved to be problematic for TAU in the past.…”

Section: Hybrid Performance Measurementmentioning

confidence: 98%

Extreme Performance Scalable Operating Systems Final Progress Report (July 1, 2008 - October 31, 2011)

Malony¹,

Shende²

2011

Self Cite

View full text Add to dashboard Cite

“…In this way, we get the best of both techniques, combining full information about communication events with low overhead from sampling of the user code. A similar hybrid approach is presented in [5], collecting the two kinds of measurement data separately in profiles and traces and merging them after measurement. In contrast, our system provides a sophisticated, seamless integration of the two measurement types, paying close attention to details such as sample interrupts inside communication events and sample intervals that contain one or more MPI calls.…”

Section: Combining Sampling and Event-based Measurementsmentioning

confidence: 99%

Performance Analysis of Long-Running Applications

Szebenyi

Wolf

Wylie

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

View full text Add to dashboard Cite

With the growing complexity of supercomputing applications and systems, it is important to constantly develop existing performance measurement and analysis tools to provide new insights into application performance characteristics and thereby help scientists and engineers utilize computing resources more efficiently. We present the various new techniques developed, implemented and integrated into the Scalasca toolset specifically to enhance performance analysis of long-running applications. The first is a hybrid measurement system seamlessly integrating sampled and event-based measurements capable of low-overhead, highly detailed measurements and therefore particularly convenient for initial performance analyses. Then we apply iteration profiling to scientific codes, and present an algorithm for reducing the memory and space requirements of the collected data using iteration profile clustering. Finally, we evaluate the complete integration of all these techniques in a unified measurement system. I. INTRODUCTIONSupercomputers play a key role in countless areas of science and engineering, enabling the development of new insights and technological advances that were previously inconceivable. The strategic importance and ever-growing complexity of the efficient usage of supercomputing resources makes parallel performance analysis tools invaluable for the scientific and engineering community. The Scalasca toolset [1] is a highly scalable, open source profiling and tracing tool supporting measurements of MPI, OpenMP and hybrid MPI/OpenMP applications that has been demonstrated to effectively scale to 294,912 processes [2]. In the course of this thesis project several improvements to the Scalasca toolset were developed, implemented and evaluated to extend its applicability to an even wider range of use cases, and provide advanced features that give more insight into the complex performance phenomena encountered in long-running, large-scale applications. Table I shows the set of representative scientific codes studied, consisting of the SPEC MPI 2007 suite of large applications complemented with the local DROPS and PEPC applications. (PEPC run with 1,024 processes on the Jugene Blue Gene/P, and the others with 256 processes on the Juropa Nehalem cluster.) These applications are written in a variety of languages with varying complexity, particularly in the use of MPI, and run at a range of scales on different HPC systems at Jülich Supercompuing Centre. Some perform thousands of iterations (or time-steps), others only hundreds, and in a couple of cases no clear iteration loop was identifiable (such as the 122.tachyon ray-tracing graphics application).

show abstract

Design and Implementation of a Hybrid Parallel Performance Measurement System

Cited by 21 publications

References 9 publications

Event Streaming for Online Performance Measurements Reduction

Event Streaming for Online Performance Measurements Reduction

Extreme Performance Scalable Operating Systems Final Progress Report (July 1, 2008 - October 31, 2011)

Performance Analysis of Long-Running Applications

Contact Info

Product

Resources

About