Understanding PARSEC performance on contemporary CMPs

Bhadauria, Major; Weaver, Vincent M.; McKee, Sally A.

doi:10.1109/iiswc.2009.5306793

Cited by 59 publications

(34 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We observed that this particular benchmark had very poor scaling when run on a large number of cores and that parallel performance gains were only realized when using a very large input. Previous work on evaluating the performance of PARSEC note that for Streamcluster, "95% of compute cycles are spent finding the Euclidean distance between two points" and that "Scaling is sensitive to memory speeds and bus contention" [12]. However, our analysis reveals that the performance issue for Streamcluster does not stem from where compute cycles are being spent, but rather where they are not being spent.…”

Section: Inefficient Barrier Implementationcontrasting

confidence: 49%

Deconstructing the overhead in parallel applications

Roth

Best²,

Mustard

et al. 2012

2012 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

Abstract-Performance problems in parallel programs manifest as lack of scalability. These scalability issues are often very difficult to debug. They can stem from synchronization overhead, poor thread scheduling decisions, or contention for hardware resources, such as shared caches. Traditional profiling tools attribute program cycles to different functions, but do not generate immediate insight into issues limiting scalability. Profiling information is very program-specific and is usually processed manually by a human expert in a time-consuming and cumbersome process.Our experience in tuning performance of parallel applications led us to discover that performance tuning can be considerably simplified, and even to some degree automated, if profiling measurements are organized according to several intuitive performance factors common to most parallel programs. In this work we present these factors and propose a hierarchical framework composing them. We present three case studies where analyzing profiling data according to the proposed principle led us to improve performance of three parallel programs by a factor of 6-20×. Our work lays foundation for new ways of organizing and visualizing profiling data in performance tuning tools.

show abstract

Section: Inefficient Barrier Implementationcontrasting

confidence: 49%

Deconstructing the overhead in parallel applications

Roth

Best²,

Mustard

et al. 2012

2012 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

show abstract

“…These benchmarks have more regular memory access patterns, which gives them relatively good cache locality. Similar trends in behaviour for compute-bound and memorybound benchmarks on simultaneous multi-threaded (SMT) multicore architectures has been observed for the PARSEC benchmark suite [6].…”

Section: Scalability Studysupporting

confidence: 48%

Garbage collection auto-tuning for Java mapreduce on multi-cores

Singer

Kovoor

Brown

et al. 2011

Proceedings of the International Symposium on Memory Management

View full text Add to dashboard Cite

MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a fourcore, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management autotuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests.

show abstract

“…(Since we are interested in pushing the number of threads to hundreds, we leave out benchmarks from the kit that either have very limited scalability, or that cannot be spawned with hundreds of threads [7] [14].) We chose the PARSEC kit because it represents emerging workloads, specifically modeling future CMP applications [15].…”

Section: B Workload and System Parametersmentioning

confidence: 99%

Threads vs. caches: Modeling the behavior of parallel workloads

Guz

Itzhak

Keidar

et al. 2010

2010 IEEE International Conference on Computer Design

View full text Add to dashboard Cite

-A new generation of high-performance engines now combine graphics-oriented parallel processors with a cache architecture. In order to meet this new trend, new highlyparallel workloads are being developed. However, it is often difficult to predict how a given application would perform on a given architecture.This paper provides a new model capturing the behavior of such parallel workloads on different multi-core architectures. Specifically, we provide a simple analytical model, which, for a given application, describes its performance and power as a function of the number of threads it runs in parallel, on a range of architectures. We use our model (backed by simulations) to study both synthetic workloads and real ones from the PARSEC suite. Our findings recognize distinctly different behavior patterns for different application families and architectures.

show abstract

Understanding PARSEC performance on contemporary CMPs

Cited by 59 publications

References 16 publications

Deconstructing the overhead in parallel applications

Deconstructing the overhead in parallel applications

Garbage collection auto-tuning for Java mapreduce on multi-cores

Threads vs. caches: Modeling the behavior of parallel workloads

Contact Info

Product

Resources

About