Modeling application performance by convolving machine signatures with application profiles

Snavely, Allan; Wolter, Nicole; Carrington, Laura

doi:10.1109/wwc.2001.990754

Cited by 55 publications

(57 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This can be achieved by building a performance model that predicts the effectiveness of communication-reduction techniques under given platform properties and application characteristics. Such performance models have been constructed for other high-performance computing applications in the past, both on the application computation performance [30,31] and on the message passing performance [32,33]. However, it is challenging to build accurate performance models for irregular applications such as the parallel sparse LU factorization because their data structures and execution behaviors are hard to predict.…”

Section: Runtime Application Adaptationmentioning

confidence: 99%

Parallel sparse LU factorization on different message passing platforms

Shen

2006

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Several message passing-based parallel solvers have been developed for general (nonsymmetric) sparse LU factorization with partial pivoting. Existing solvers were mostly deployed and evaluated on parallel computing platforms with high message passing performance (e.g., 1-10 µs in message latency and 100-1000 Mbytes/sec in message throughput) while little attention has been paid on slower platforms. This paper investigates techniques that are specifically beneficial for LU factorization on platforms with slow message passing. In the context of the S + distributed memory solver, we find that significant reduction in the application message passing overhead can be attained at the cost of extra computation and slightly weakened numerical stability. In particular, we propose batch pivoting to make pivot selections in groups through speculative factorization, and thus substantially decrease the interprocessor synchronization granularity. We experimented on three different message passing platforms with different communication speeds. While the proposed techniques provide no performance benefit and even slightly weaken numerical stability on an IBM Regatta multiprocessor with fast message passing, they improve the performance of our test matrices by 15-460% on an Ethernet-connected 16-node PC cluster. Given the different tradeoffs of communication-reduction techniques on different message passing platforms, we also propose a sampling-based runtime application adaptation approach that automatically determines whether these techniques should be employed for a given platform and input matrix.

show abstract

Section: Runtime Application Adaptationmentioning

confidence: 99%

Parallel sparse LU factorization on different message passing platforms

Shen

2006

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…Snavely et. al use profile convolving [1] a trace based method which involves the creation of a machine profile and an application profile. Machine profiles describe the behavior of loads and stores for the given processor, while the application profile is a runtime utility which captures and statistically records all memory references.…”

Section: Previous Workmentioning

confidence: 99%

Modelling the Performance of the Gaussian Chemistry Code on x86 Architectures

Antony

Frisch²,

Rendell

2008

Modeling, Simulation and Optimization of Complex Processes

View full text Add to dashboard Cite

Summary. Gaussian is a widely used scientific code with application areas in chemistry, biochemistry and material sciences. To operate efficiently on modern architectures Gaussian employs cache blocking in the generation and processing of the twoelectron integrals that are used by many of its electronic structure methods. This study uses hardware performance counters to characterise the cache and memory behavior of the integral generation code used by Gaussian in Hartree-Fock calculations. A simple performance model is proposed that aims to predict overall performance as a function of total instruction and cache miss counts. The model is parameterised for three different x86 processors -the Intel Pentium M, the P4 and the AMD Opteron. Results suggest that the model is capable of predicting execution times to an accuracy of between 5 and 15%. Use of this model in developing a dynamic cache blocking scheme is also discussed.

show abstract

“…We evaluate the performance model in which we use true hardware counters through PAPI [2] to predict the performance (henceforth called the PAPI model) and compare it to the model in which we use estimates of lower and upper bound of cache and TLB misses (henceforth termed the analytic lower and upper bound models). The cache and memory latencies were derived [15] from published processor manuals, curve fitting, and experimental work using the Saavedra-Barrera memory system microbenchmark [10] and MAPS benchmarks [11]. Due to space limitations we present a summary of the full data [8].…”

Section: Verification Of the Analytic Modelmentioning

confidence: 99%

When cache blocking of sparse matrix vector multiply works and why

Nishtala

Vuduc

Demmel

et al. 2007

AAECC

View full text Add to dashboard Cite

We present new performance models and more compact data structures for cache blocking when applied to sparse matrix-vector multiply (SpM×V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimum block sizes. In addition, we determine criteria that predict when cache blocking improves performance. We conclude with architectural suggestions that would make memory systems execute SpM×V faster.

show abstract

Modeling application performance by convolving machine signatures with application profiles

Cited by 55 publications

References 20 publications

Parallel sparse LU factorization on different message passing platforms

Parallel sparse LU factorization on different message passing platforms

Modelling the Performance of the Gaussian Chemistry Code on x86 Architectures

When cache blocking of sparse matrix vector multiply works and why

Contact Info

Product

Resources

About