Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Manumachu, Ravi Reddy; Lastovetsky, Alexey

doi:10.1002/cpe.4958

Cited by 11 publications

(20 citation statements)

References 61 publications

(129 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The variations are caused by the inherent complexities in modern multicore CPU platforms such as resource contention for shared resources on-chip such as last level cache (LLC) and interconnect. References [32,44,45] demonstrate by executing real-life multi-threaded data-parallel applications on modern multicore CPUs that the functional relationships between performance and workload size and between energy and workload size have complex (non-linear) properties. (Appendix E Tables A1-A3) present the statistics of prediction error between RAPL and HCLWattsUp on HCLServer03.…”

Section: Experimental Results On Hclserver03mentioning

confidence: 99%

A Comparative Study of Methods for Measurement of Energy of Computing

et al. 2019

Self Cite

View full text Add to dashboard Cite

Energy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical measurements using external power meters; (b) Measurements using on-chip power sensors and (c) Energy predictive models. In this work, we present a comprehensive study comparing the accuracy of state-of-the-art on-chip power sensors and energy predictive models against system-level physical measurements using external power meters, which we consider to be the ground truth. We show that the average error of the dynamic energy profiles obtained using on-chip power sensors can be as high as 73% and the maximum reaches 300% for two scientific applications, matrix-matrix multiplication and 2D fast Fourier transform for a wide range of problem sizes. The applications are executed on three modern Intel multicore CPUs, two Nvidia GPUs and an Intel Xeon Phi accelerator. The average error of the energy predictive models employing performance monitoring counters (PMCs) as predictor variables can be as high as 32% and the maximum reaches 100% for a diverse set of seventeen benchmarks executed on two Intel multicore CPUs (one Haswell and the other Skylake). We also demonstrate that using inaccurate energy measurements provided by on-chip sensors for dynamic energy optimization can result in significant energy losses up to 84%. We show that, owing to the nature of the deviations of the energy measurements provided by on-chip sensors from the ground truth, calibration can not improve the accuracy of the on-chip sensors to an extent that can allow them to be used in optimization of applications for dynamic energy. Finally, we present the lessons learned, our recommendations for the use of on-chip sensors and energy predictive models and future directions.

show abstract

Section: Experimental Results On Hclserver03mentioning

confidence: 99%

A Comparative Study of Methods for Measurement of Energy of Computing

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Severe resource contention due to tight integration of tens of cores organized in multiple sockets with multi-level cache hierarchy and contending for shared on-chip resources such as last level cache (LLC), interconnect (For example, Intel's Quick Path Interconnect, AMD's Hyper Transport), and DRAM controllers; b) Non-uniform memory access (NUMA) where the time for memory access between a core and main memory is not uniform and where main memory is distributed between locality domains or groups called NUMA nodes; and c) Dynamic power management (DPM) of multiple power domains (CPU sockets, DRAM). Lastovetsky and Reddy [8], Reddy and Lastovetsky [61] propose data partitioning algorithms that solve singleobjective optimization problems of data-parallel applications for performance or energy on homogeneous clusters of multicore CPUs. They take as an input, discrete performance and dynamic energy functions with no shape assumptions that accurately and realistically account for resource contention and NUMA inherent in modern multicore CPU platforms.…”

Section: B Overview Of Application-level Techniquesmentioning

confidence: 99%

Energy Predictive Models of Computing: Theory, Practical Implications and Experimental Analysis on Multicore Processors

et al. 2021

Self Cite

View full text Add to dashboard Cite

The energy efficiency in ICT is becoming a grand technological challenge and is now a first-class design constraint in all computing settings. Energy predictive modelling based on performance monitoring counters (PMCs) is the leading method for application-level energy optimization. However, a sound theoretical framework to understand the fundamental significance of the PMCs to the energy consumption and the causes of the inaccuracy of the models is lacking. In this work, we propose a small but insightful theory of energy predictive models of computing, which formalizes both the assumptions behind the existing PMC-based energy predictive models and properties, heretofore unconsidered, that are basic implications of the universal energy conservation law. The theory's basic practical implications include selection criteria for model variables, model intercept, and model coefficients. The experiments on two modern Intel multicore servers show that applying the proposed selection criteria improves the prediction accuracy of state-of-the-art linear regression models from 31.2% to 18%. Finally, we demonstrate that employing energy models constructed using the proposed theory for energy optimization can save a significant amount of energy (up to 80% for applications used in experiments) compared to state-of-the-art energy measurement tools.

show abstract

“…The cost of building the full speed functions of the abstract processors can be expensive. To reduce the cost, one approach is to build partial speed functions that are input to HiPOPTA to output optimal workload distribution for the specific input speed functions [7], [17].…”

Section: Hipopta: Hierarchical Two-level Data Partitioning Algorithm Solving Hipoptmentioning

confidence: 99%

“…This is because, for each data point, statistical averaging is performed, which involves multiple runs of the application to determine the sample means for the execution times. To reduce the cost, one approach is to build partial speed functions [7], [17], which are input to HiPOPTA to output optimal workload distribution for the specific input speed functions.…”

Section: B Speed/performance Functions Of the Applicationsmentioning

confidence: 99%

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

2020

Self Cite

View full text Add to dashboard Cite

Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of h identical nodes where each node has c heterogeneous processors. This algorithm takes as input c discrete speed functions of cardinality m corresponding to the c heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of O(m 2 × h + m 3 × c 3 ) unlike the state-of-the-art algorithm solving the same problem with the complexity of O(m 3 × c 3 × h 3 ). We also propose an extension of the algorithm for clusters of h non-identical nodes where each node has c heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.

show abstract

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Cited by 11 publications

References 61 publications

A Comparative Study of Methods for Measurement of Energy of Computing

A Comparative Study of Methods for Measurement of Energy of Computing

Energy Predictive Models of Computing: Theory, Practical Implications and Experimental Analysis on Multicore Processors

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

Contact Info

Product

Resources

About