Rigorous benchmarking in reasonable time

Kalibera, Tomas; Jones, Richard

doi:10.1145/2491894.2464160

Cited by 36 publications

(24 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Once machine code generation has completed, the VM is said to have finished warming up, and the program is said to be executing at a steady state of peak performance. 1 While the length of the warmup period is dependent on the program and JIT compiler, all JIT compiling VMs are based on this performance model [Kalibera and Jones 2013].…”

Section: Discussionmentioning

confidence: 99%

“…Georges et al [2007]. Kalibera and Jones [2013] convincingly show the limitations of such approaches, presenting instead a manual approach to determining if and when a steady state has been reached. While this is a significant improvement on previous methods, it is time-consuming, prone to human inconsistency, and gives no indication as to whether the steady state represents peak performance or not.…”

Section: Overview Of the Methodologymentioning

confidence: 99%

See 1 more Smart Citation

Virtual machine warmup blows hot and cold

Barrett

Bolz-Tereick

Killick

et al. 2017

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

Virtual Machines (VMs) with Just-In-Time (JIT) compilers are traditionally thought to execute programs in two phases: the initial warmup phase determines which parts of a program would most benefit from dynamic compilation, before JIT compiling those parts into machine code; subsequently the program is said to be at a steady state of peak performance. Measurement methodologies almost always discard data collected during the warmup phase such that reported measurements focus entirely on peak performance. We introduce a fully automated statistical approach, based on changepoint analysis, which allows us to determine if a program has reached a steady state and, if so, whether that represents peak performance or not. Using this, we show that even when run in the most controlled of circumstances, small, deterministic, widely studied microbenchmarks often fail to reach a steady state of peak performance on a variety of common VMs. Repeating our experiment on 3 different machines, we found that at most 43.5% of ⟨VM, benchmark⟩ pairs consistently reach a steady state of peak performance.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Overview Of the Methodologymentioning

confidence: 99%

Virtual machine warmup blows hot and cold

Barrett

Bolz-Tereick

Killick

et al. 2017

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

show abstract

“…Note that the slightly higher geomean gains on the AMD system, as compared to the Intel system, are due to the underlying differences in the architecture and such improved gains have been observed across all the kernels (both high‐sync‐op and low‐sync‐op). Since the speedups in Figure are small (close to 1×), we also report the confidence intervals for the speedup ratio (as defined by Kalibera and Jones) for the 16‐core Intel system and 64‐core AMD system in Figure . The narrow width of the confidence intervals shows that the execution time is fairly stable across different runs.…”

Section: Implementation and Evaluationmentioning

confidence: 99%

“…This is because the latter includes a series of parallel‐for‐loops leading to significant task creation and termination overheads, which is avoided in the former because of the use of clocks. Since the speedups in Figure are small (close to 1×), we also report the confidence intervals for the speedup ratio (as defined by Kalibera and Jones) of the async‐finish kernel versions compared to the baseline and uClocks versions, for two of the highest configurations (the 16‐core Intel system and 64‐core AMD system in Figure ).…”

Section: Implementation and Evaluationmentioning

confidence: 99%

Efficient lock‐step synchronization in task‐parallel languages

Utture

Nandivada

2019

Softw Pract Exp

View full text Add to dashboard Cite

Summary Many modern task‐parallel languages allow the programmer to synchronize tasks using high‐level constructs like barriers, clocks, and phasers. While these high‐level synchronization primitives help the programmer express the program logic in a convenient manner, they also have their associated overheads. In this paper, we identify the sources of some of these overheads for task‐parallel languages like X10 that support lock‐step synchronization, and propose a mechanism to reduce these overheads. We first propose three desirable properties that an efficient runtime (for task‐parallel languages like X10, HJ, Chapel, and so on) should satisfy, to minimize the overheads during lock‐step synchronization. We use these properties to derive a scheme to called uClocks to improve the efficiency of X10 clocks; uClocks consists of an extension to X10 clocks and two related runtime optimizations. We prove that uClocks satisfies the proposed desirable properties. We have implemented uClocks for the X10 language+runtime and show that the resulting system leads to a geometric mean speedup of 5.36× on a 16‐core Intel system and 11.39× on a 64‐core AMD system, for benchmarks with a significant number of synchronization operations.

show abstract

“…Performance measurements may lead to incorrect results if not handled carefully [1]. Thus, a statistical rigorous performance evaluation is required [16,23,28]. To mitigate instability and incorrect results, we differentiate VM start-up and steady-state.…”

Section: Corpusmentioning

confidence: 99%

Darwinian data structure selection

Basios

et al. 2018

Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

View full text Add to dashboard Cite

Data structure selection and tuning is laborious but can vastly improve an application's performance and memory footprint. Some data structures share a common interface and enjoy multiple implementations. We call them Darwinian Data Structures (DDS), since we can subject their implementations to survival of the fittest. We introduce artemis a multi-objective, cloud-based search-based optimisation framework that automatically finds optimal, tuned DDS modulo a test suite, then changes an application to use that DDS. artemis achieves substantial performance improvements for every project in 5 Java projects from DaCapo benchmark, 8 popular projects and 30 uniformly sampled projects from GitHub. For execution time, CPU usage, and memory consumption, artemis finds at least one solution that improves all measures for 86% (37/43) of the projects. The median improvement across the best solutions is 4.8%, 10.1%, 5.1% for runtime, memory and CPU usage.These aggregate results understate artemis's potential impact. Some of the benchmarks it improves are libraries or utility functions. Two examples are gson, a ubiquitous Java serialization framework, and xalan, Apache's XML transformation tool. artemis improves gson by 16.5%, 1% and 2.2% for memory, runtime, and CPU; artemis improves xalan's memory consumption by 23.5%. Every client of these projects will benefit from these performance improvements."Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. "-Donald E. Knuth [24]

show abstract

Rigorous benchmarking in reasonable time

Cited by 36 publications

References 21 publications

Virtual machine warmup blows hot and cold

Virtual machine warmup blows hot and cold

Efficient lock‐step synchronization in task‐parallel languages

Darwinian data structure selection

Contact Info

Product

Resources

About