Statistically rigorous java performance evaluation

Georges, Andy; Buytaert, Dries; Eeckhout, Lieven

doi:10.1145/1297027.1297033

Cited by 378 publications

(241 citation statements)

References 19 publications

Supporting

Mentioning

229

Contrasting

Unclassified

Order By: Relevance

“…Therefore graph topology, and in particular the splitting factor at data-parallel split-join nodes, can have an impact on performance. The experimental results were performed according to the methodology suggested by Georges et al [11].…”

Section: Resultsmentioning

confidence: 99%

SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism

Sbîrlea

Shirako

Newton

et al. 2015

Int J Parallel Prog

View full text Add to dashboard Cite

Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelization. To take full advantage of the paradigm, however, typically requires programmers to learn a new language and re-implement their applications. This work shows that it is possible to exploit streaming as a safe and automatic optimization of a more general dataflow-based model-one in which computation kernels are written in standard, general-purpose languages and organized as a coordination graph.We propose Streaming Concurrent Collections (SCnC), a streaming system that can efficiently run a subset of programs supported by Concurrent Collections (CnC). CnC is a general purpose parallel programming paradigm with a task-parallel look and feel but based on dataflow graph principles. Its expressivity extends to any arbitrary task graph. Integration of these models would allow application developers to benefit from the performance and tight memory footprint of stream parallelism for eligible subgraphs of their application.In this paper we formally define the requirements (streaming access patterns) needed for using SCnC, and outline a static decision procedure for identifying and processing eligible SCnC subgraphs. We present initial results on an prototype implementation that show that transitioning from general CnC to SCnC leads to a throughput increase of up to 40× for certain benchmarks, and also enable programs with large data sizes to execute in available memory for cases where CnC execution may run out of memory.

show abstract

Section: Resultsmentioning

confidence: 99%

SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism

Sbîrlea

Shirako

Newton

et al. 2015

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…We have implemented these four default policies in Panini Capsules and our comparison uses the same Panini program. We measure program runtime and CPU consumption for thread, round-robin, random, work-stealing and our technique when the steady-state performance is reached by following the methodology of Georges et al [13]. We compare program runtime and CPU consumption for these five policies on 2, 4, 8, and 12 cores settings (Linux taskset utility is used for altering core settings on 12-core system).…”

Section: Methodsmentioning

confidence: 99%

Effectively mapping linguistic abstractions for message-passing concurrency to threads on the Java virtual machine

Upadhyaya

Rajan

2015

Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applicatio

View full text Add to dashboard Cite

Efficient mapping of message passing concurrency (MPC) abstractions to Java Virtual Machine ( JVM) threads is critical for performance, scalability, and CPU utilization; but tedious and time consuming to perform manually. In general, this mapping cannot be found in polynomial time, but we show that by exploiting the local characteristics of MPC abstractions and their communication patterns this mapping can be determined effectively. We describe our MPC abstraction to thread mapping technique, its realization in two frameworks (Panini and Akka), and its rigorous evaluation using several benchmarks from representative MPC frameworks. We also compare our technique against four default mapping techniques: thread-all, roundrobin-task-all, random-task-all and work-stealing. Our evaluation shows that our mapping technique can improve the performance by 30%-60% over default mapping techniques. These improvements are due to a number of challenges addressed by our technique namely: i) balancing the computations across JVM threads, ii) reducing the communication overheads, iii) utilizing information about cache locality, and iv) mapping MPC abstractions to threads in a way that reduces the contention between JVM threads. AbstractEfficient mapping of message passing concurrency (MPC) abstractions to Java Virtual Machine (JVM) threads is critical for performance, scalability, and CPU utilization; but tedious and time consuming to perform manually. In general, this mapping cannot be found in polynomial time, but we show that by exploiting the local characteristics of MPC abstractions and their communication patterns this mapping can be determined effectively. We describe our MPC abstraction to thread mapping technique, its realization in two frameworks (Panini and Akka), and its rigorous evaluation using several benchmarks from representative MPC frameworks. We also compare our technique against four default mapping techniques: thread-all, round-robin-task-all, random-task-all and work-stealing. Our evaluation shows that our mapping technique can improve the performance by 30%-60% over default mapping techniques. These improvements are due to a number of challenges addressed by our technique namely: i) balancing the computations across JVM threads, ii) reducing the communication overheads, iii) utilizing information about cache locality, and iv) mapping MPC abstractions to threads in a way that reduces the contention between JVM threads.

show abstract

“…Within each trial, every benchmark (e.g., etcd-1) consists of 50 repeated executions (e.g., using the -i50 parameter of JMH) and every execution produces a single data point, which reports the average execution time in ns. For JMH benchmarks, we also run 10 warmup executions (after which steady-state performance is most likely reached [9]) prior to the test executions. The performance counters originating from warmup iterations are discarded.…”

Section: Packagementioning

confidence: 99%

“…Others report on wrongly quantified experimental evaluations by ignoring uncertainty of measurements through nondeterministic behavior of software systems (e.g., memory placement, dynamic compilation) [15]. Dealing with non-deterministic behavior of dynamically optimized programming languages, Georges et al [9] summarize methodologies to measure languages like Java, and explain statistical methods to use for performance evaluation. All of these studies expect an as stable as possible environment to run performance experiments on.…”

Section: Related Workmentioning

confidence: 99%

Performance testing in the cloud. How bad is it really?

Laaber

Scheuner

Leitner

2018

Preprint

View full text Add to dashboard Cite

Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance testing hardware, making public clouds an attractive alternative. However, cloud environments are inherently unpredictable and variable with respect to their performance. In this study, we explore the effects of cloud environments on the variability of performance testing outcomes, and to what extent regressions can still be reliably detected. We focus on software microbenchmarks as an example of performance tests, and execute extensive experiments on three different cloud services (AWS, GCE, and Azure) and for different types of instances. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (from 0.03% to > 100% relative standard deviation). We also observe that testing using Wilcoxon rank-sum generally leads to unsatisfying results for detecting regressions due to a very high number of false positives in all tested configurations. However, simply testing for a difference in medians can be employed with good success to detect even small differences. In some cases, a difference as low as a 1% shift in median execution time can be found with a low false positive rate given a large sample size of 20 instances.

show abstract

Statistically rigorous java performance evaluation

Cited by 378 publications

References 19 publications

SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism

SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism

Effectively mapping linguistic abstractions for message-passing concurrency to threads on the Java virtual machine

Performance testing in the cloud. How bad is it really?

Contact Info

Product

Resources

About