Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

Curtis-Maury, Matthew; Blagojević, Filip; Antonopoulos, Christos D.; Nikolopoulos, Dimitrios S.

doi:10.1109/tpds.2007.70804

Cited by 83 publications

(50 citation statements)

References 37 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many state-of-the-art algorithms for software-controlled dynamic power management [5], [6], [7], [8] use dynamic voltage and frequency scaling (DVFS) to dilate computation into slack (any nonoverlapped hardware or algorithmic latency) that occurs between MPI communication events, thus reducing energy consumption. Alternatively, dynamic concurrency throttling (DCT) [9], [10] controls the number of active threads executing pieces of parallel code, particularly in sharedmemory programming models like OpenMP, to save energy and to improve performance simultaneously [11].…”

Section: Introductionmentioning

confidence: 99%

Hybrid MPI/OpenMP power-aware computing

Dong

Supinski

Schulz

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing (IPDPS)

112

View full text Add to dashboard Cite

Abstract-Power-aware execution of parallel programs is now a primary concern in large-scale HPC environments. Prior research in this area has explored models and algorithms based on dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) to achieve power-aware execution of programs written in a single programming model, typically MPI or OpenMP. However, hybrid programming models combining MPI and OpenMP are growing in popularity as emerging large-scale systems have many nodes with several processors per node and multiple cores per processor. In this paper we present and evaluate solutions for power-efficient execution of programs written in this hybrid model targeting large-scale distributed systems with multicore nodes. We use a new power-aware performance prediction model of hybrid MPI/OpenMP applications to derive a novel algorithm for power-efficient execution of realistic applications from the ASC Sequoia and NPB MZ benchmarks. Our new algorithm yields substantial energy savings (4.18% on average and up to 13.8%) with either negligible performance loss or performance gain (up to 7.2%).

show abstract

Section: Introductionmentioning

confidence: 99%

Hybrid MPI/OpenMP power-aware computing

Dong

Supinski

Schulz

et al. 2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing (IPDPS)

112

View full text Add to dashboard Cite

show abstract

“…The number of threads is used to apply Dynamic Concurrency Throttling (DCT) [6,21]. DCT adjusts the number and placement of threads used in each phase of a parallel program running on a shared-memory architecture, to sustain optimal performance while reducing energy consumption.…”

Section: Preliminariesmentioning

confidence: 99%

“…Curtis-Maury, et al [5,6,21] use linear regression models for online powerperformance adaptation of multithreaded codes on multi-core architectures. Our work differs from their research in several aspects.…”

Section: Related Workmentioning

confidence: 99%

“…These challenges make analytical models of system performance increasingly hard to construct and ultimately inaccurate, even in architecture-specific, application-specific or inputspecific contexts [1][2][3]. These difficulties, along with the increasing diversity in parallel architectures, give rise to statistical, black-box techniques for performance prediction on parallel architectures [4][5][6][7][8][9][10][11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

Khasymski

Nikolopoulos

2014

International Journal of Parallel, Emergent and Distributed Sys

View full text Add to dashboard Cite

This paper presents a scalable, statistical "black-box" model for predicting the performance of parallel programs on multi-core NUMA systems. We derive a model with low overhead, by reducing data collection and model training time . The model can accurately predict the behavior of parallel applications in response to changes in their concurrency, thread layout on NUMA nodes, and core voltage and frequency. We present a framework that applies the model to achieve significant energy and energy-delay-square (ED 2 ) savings (9% and 25% respectively) along with performance improvement (10% mean) on an actual 16-core NUMA system running realistic application workloads. Our prediction model proves substantially more accurate than previous efforts.

show abstract

“…The latter are available through concurrency throttling, a technique which adjusts on-the-fly the degree of active concurrency in the program, so that the program uses the minimum number of cores necessary to sustain the highest level of performance possible. We have presented results on a study of the design space of on-line and off-line predictors for dynamic, phase-aware program adaptation in several conference and journal papers [12,11,10,18,16,17]. This research was facilitated through a collaboration between the PI and Lawrence Livermore National Laboratory.…”

Section: The Melisses Continuous Hardware Monitormentioning

confidence: 99%

COMPUTER SCIENCE RESEARCH MELISSES: Liquid Services for Scalable Multithreaded and Multicore Execution on Emerging Supercomputers

Nikolopoulos¹

2008

View full text Add to dashboard Cite

In the following sections, we summarize the contributions made through support from this DOE ECPI award to research and training in advanced computing systems.1 Dynamic scheduling of layered parallelism on emerging multi-core processors and many-core clustersWe have developed several schedulers for dynamic multi-grain parallelization on the Cell Broadband Engine. The Cell processor presents a new paradigm for parallel computing on multicore platforms, by combining conventional processor cores with customized accelerators and by offering an explicitly managed memory hierarchy to programmers, for tighter control of locality and performance. Parallel computation on the Cell is accomplished by off-loading compute-intensive and data-intensive code from the conventional cores to the vector SIMD accelerators. Heterogeneous multi-core architectures such as the Cell represent a design point in computer architecture which holds greater promise for sustaining high performance and power-efficiency than conventional, homogeneous multi-core architectures. Cell is also the processor of choice for Roadrunner, a Petaflop-capable supercomputer currently in the development phase by IBM. Due to these reasons, we believe that the research conducted on Cell with support from the DOE ECPI award is timely, relevant and in line with DOE missions. The first of the novel schedulers developed in this activity, named MGPS-SLED (for Multi-grain Parallelism Scheduling using Slack Minimizing Event-Drive execution), exploits effectively thread-level and data-level (SIMD) parallelism at runtime, without prior knowledge of the application or input from the programmer. MGPS-SLED follows an event-driven execution model for scheduling tasks and data parallelism of varying granularity, on the synergistic processing elements (SPE) of the Cell. MGPS-SLED provides a novel mechanism for deciding between task-level, loop-level and data-level parallelization on the fly, based on runtime workload characterization and observable utilization metrics on the SPEs. As part of the MGPS-SLED effort, we have ported the MELISSES hardware monitor on the Cell PPE and SPE -the conventional power processing element and the synergistic processing elements of the processor respectively-, to collect continuous data on SPE and PPE utilization and drive the multi-grain decomposition and scheduling processes. More specifically, MELISSES enabled us to collect a historical profile of task execution on the SPE, which in conjunction with program phase analysis, enabled MGPS-SLED to adaptively select the layers and degrees of parallelism to activate in any phase of the program. We emphasize the major contribution of MGPS-SLED, namely phase-aware optimization of the scheduling process, which would have been impossible without leveraging the MELISSES performance monitoring framework. Phase-aware program control in MELISSES has enabled unprecedented performance and power optimizations in parallel programs. We view this result as one of the major contributions of this effort.MGPS-SLED...

show abstract

Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

Cited by 83 publications

References 37 publications

Hybrid MPI/OpenMP power-aware computing

Hybrid MPI/OpenMP power-aware computing

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

COMPUTER SCIENCE RESEARCH MELISSES: Liquid Services for Scalable Multithreaded and Multicore Execution on Emerging Supercomputers

Contact Info

Product

Resources

About