HPCT<scp>OOLKIT</scp>: tools for performance analysis of optimized parallel programs

Adhianto, Laksono; Banerjee, Subarno; Fagan, Michael; Krentel, Mark W.; Marin, Gabriel; Mellor-Crummey, John; Tallent, Nathan R.

doi:10.1002/cpe.1553

Cited by 416 publications

(299 citation statements)

References 40 publications

(38 reference statements)

Supporting

Mentioning

290

Contrasting

Unclassified

Order By: Relevance

“…Burtscher et al [10] designed Perfexpert to automate identifying the performance bottlenecks of HPC applications with predefined rules. Adhianto et al [3] designed HPCToolkit to measure hardware events and to correlate the events with source code to identify performance bottlenecks of parallel applications. The detection mechanisms of these tools were heavily dependent on manually created metrics and rules.…”

Section: Related Workmentioning

confidence: 99%

“…1.5(a, c, e) show the average execution time of checkpoints 25, 31 and 36 for the number of saved objects, and the linear relation between the average execution time and the number of saved objects. 3 It shows the performance bottleneck in these checkpoints when computed with the large number of stored objects. This is because the large number of saved objects requires more comparisons and computation.…”

Section: Analysis Of Saved Objectsmentioning

confidence: 99%

See 1 more Smart Citation

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Yoo

Koo

Cao

et al. 2016

Conquering Big Data With High Performance Computing

View full text Add to dashboard Cite

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe-art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster. Our study processed many terabytes of system logs and application performance measurements collected on the HPC systems at NERSC. The implementation of our tool is generic enough to be used for analyzing the performance of other HPC systems and Big Data workows.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Analysis Of Saved Objectsmentioning

confidence: 99%

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Yoo

Koo

Cao

et al. 2016

Conquering Big Data With High Performance Computing

View full text Add to dashboard Cite

show abstract

“…Based on this assessment, the compiler skips the instrumentation of those functions that are either short or called within nested loops. However, generally not instrumenting small functions was criticized by Adhianto et al [1]. They argue that small functions often play a significant role, for example, if they include synchronization calls important to parallel performance.…”

Section: Related Workmentioning

confidence: 99%

Reducing the Overhead of Direct Application Instrumentation Using Prior Static Analysis

Mußler¹,

Lorenz²,

Wolf

2011

Euro-Par 2011 Parallel Processing

View full text Add to dashboard Cite

Abstract. Preparing performance measurements of HPC applications is usually a tradeoff between accuracy and granularity of the measured data. When using direct instrumentation, that is, the insertion of extra code around performance-relevant functions, the measurement overhead increases with the rate at which these functions are visited. If applied indiscriminately, the measurement dilation can even be prohibitive. In this paper, we show how static code analysis in combination with binary rewriting can help eliminate unnecessary instrumentation points based on configurable filter rules. In contrast to earlier approaches, our technique does not rely on dynamic information, making extra runs prior to the actual measurement dispensable. Moreover, the rules can be applied and modified without re-compilation. We evaluate filter rules designed for the analysis of computation and communication performance and show that in most cases the measurement dilation can be reduced to a few percent while still retaining significant detail.

show abstract

“…Furthermore, some higher level analysis tools gather additional information by combining the HPM counts with application level traces. Popular representatives of that analysis method are HPCToolkit [1], PerfSuite [10], Open|Speedshop [16] or Scalasca [3]. The intention of these tools is to advise the application developer with educated optimization hints.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Validation of Hardware Events for Successful Performance Pattern Identification in High Performance Computing

Röhl

Eitzinger

Hager

et al. 2016

Tools for High Performance Computing 2015

View full text Add to dashboard Cite

Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf_event which provide HPM access with some additional features, many higher level tools combine event counts with results retrieved from other sources like function call traces to derive (semi-)automatic performance advice. However, although HPM is available for x86 systems since the early 90s, only a small subset of the HPM features is used in practice. Performance patterns provide a more comprehensive approach, enabling the identification of various performancelimiting effects. Patterns address issues like bandwidth saturation, load imbalance, non-local data access in ccNUMA systems, or false sharing of cache lines. This work defines HPM event sets that are best suited to identify a selection of performance patterns on the Intel Haswell processor. We validate the chosen event sets for accuracy in order to arrive at a reliable pattern detection mechanism and point out shortcomings that cannot be easily circumvented due to bugs or limitations in the hardware.

show abstract

HPCTOOLKIT: tools for performance analysis of optimized parallel programs

Cited by 416 publications

References 40 publications

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Reducing the Overhead of Direct Application Instrumentation Using Prior Static Analysis

Validation of Hardware Events for Successful Performance Pattern Identification in High Performance Computing

Contact Info

Product

Resources

About