Machine learning based online performance prediction for runtime parallelization and task scheduling

Li, Jiangtian; Ma, Xiaosong; Singh, Karan; Schulz, Martin; Supinski, Bronis R. de; McKee, Sally A.

doi:10.1109/ispass.2009.4919641

Cited by 44 publications

(24 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Li, et al [26], similarly to our work, use ANNs as models for online performance prediction, that they apply to task partitioning and scheduling for HPC clusters. Their work does not consider NUMA architectures, core and data layout effects, or power metrics.…”

Section: Related Workmentioning

confidence: 90%

See 1 more Smart Citation

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

Khasymski

Nikolopoulos

2014

International Journal of Parallel, Emergent and Distributed Sys

View full text Add to dashboard Cite

This paper presents a scalable, statistical "black-box" model for predicting the performance of parallel programs on multi-core NUMA systems. We derive a model with low overhead, by reducing data collection and model training time . The model can accurately predict the behavior of parallel applications in response to changes in their concurrency, thread layout on NUMA nodes, and core voltage and frequency. We present a framework that applies the model to achieve significant energy and energy-delay-square (ED 2 ) savings (9% and 25% respectively) along with performance improvement (10% mean) on an actual 16-core NUMA system running realistic application workloads. Our prediction model proves substantially more accurate than previous efforts.

show abstract

Section: Related Workmentioning

confidence: 90%

“…Li, et al, [26] and Ipek, et al, [27] use artificial neural networks (ANNs) as blackbox models for microarchitectural design space exploration. Li, et al [26], similarly to our work, use ANNs as models for online performance prediction, that they apply to task partitioning and scheduling for HPC clusters.…”

Section: Related Workmentioning

confidence: 99%

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

Khasymski

Nikolopoulos

2014

International Journal of Parallel, Emergent and Distributed Sys

View full text Add to dashboard Cite

show abstract

“…They measured assignments, branches, and loops at run-time using dynamic analysis of the program. In [22], non-deterministic features were measured. The variables ar, ot were represented using performance counters: number of CPU cycles, number of cache misses, cache accesses for the last cache level, and number of level one cache hits.…”

Section: Data Generation and Model Selectionmentioning

confidence: 99%

Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning

Laberge

Shirzad

Diehl

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

View full text Add to dashboard Cite

Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.

show abstract

“…In supervised learning, on the other hand, the algorithm is provided Best tuning params. 13,15,20,21,28,29 7,29 24,30,31 Work distribution 25,34 33,34 Power 23,27 Output size 17 between the input and output variables of interest. These data are then given to a machine learning algorithm that builds a model.…”

Section: Machine Learning For Performance Modelingmentioning

confidence: 99%

Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications

Falch

Elster

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Heterogeneous computing, combining devices with different architectures such as CPUs and GPUs, is rising in popularity and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programming such systems and offers functional portability. However, it suffers from poor performance portability, because applications must be retuned for every new device. In this paper, we use machine learning‐based auto‐tuning to address this problem. Benchmarks are run on a random subset of the tuning parameter spaces, and the results are used to build a machine learning‐based performance model. The model can then be used to find interesting subspaces for further search. We evaluate our method using five image processing benchmarks, with tuning parameter space sizes up to 2.3 M, using different input sizes, on several devices, including an Intel i7 4771 (Haswell) CPU, an Nvidia Tesla K40 GPU, and an AMD Radeon HD 7970 GPU. We compare different machine learning algorithms for the performance model. Our model achieves a mean relative error as low as 3.8% and is able to find solutions on average only 0.29% slower than the best configuration in some cases, evaluating less than 1.1% of the search space. The source code of our framework is available at https://github.com/acelster/ML‐autotuning.

show abstract

Machine learning based online performance prediction for runtime parallelization and task scheduling

Cited by 44 publications

References 38 publications

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning

Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications

Contact Info

Product

Resources

About