2018
DOI: 10.1002/cpe.4958
|View full text |Cite
|
Sign up to set email alerts
|

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Abstract: Self-adaptability is a highly preferred feature in HPC applications. A crucial building block of a self-adaptable application is a data partitioning algorithm that must possess several essential qualities apart from low runtime and memory costs. On modern platforms composed of multicore CPU processors, data partitioning algorithms striving to solve the bi-objective optimization problem for performance and energy (BOPPE) face a formidable challenge. They must take into account the new complexities inherent in t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
1
1

Relationship

4
3

Authors

Journals

citations
Cited by 11 publications
(20 citation statements)
references
References 61 publications
(129 reference statements)
0
20
0
Order By: Relevance
“…The variations are caused by the inherent complexities in modern multicore CPU platforms such as resource contention for shared resources on-chip such as last level cache (LLC) and interconnect. References [32,44,45] demonstrate by executing real-life multi-threaded data-parallel applications on modern multicore CPUs that the functional relationships between performance and workload size and between energy and workload size have complex (non-linear) properties. (Appendix E Tables A1-A3) present the statistics of prediction error between RAPL and HCLWattsUp on HCLServer03.…”
Section: Experimental Results On Hclserver03mentioning
confidence: 99%
“…The variations are caused by the inherent complexities in modern multicore CPU platforms such as resource contention for shared resources on-chip such as last level cache (LLC) and interconnect. References [32,44,45] demonstrate by executing real-life multi-threaded data-parallel applications on modern multicore CPUs that the functional relationships between performance and workload size and between energy and workload size have complex (non-linear) properties. (Appendix E Tables A1-A3) present the statistics of prediction error between RAPL and HCLWattsUp on HCLServer03.…”
Section: Experimental Results On Hclserver03mentioning
confidence: 99%
“…Severe resource contention due to tight integration of tens of cores organized in multiple sockets with multi-level cache hierarchy and contending for shared on-chip resources such as last level cache (LLC), interconnect (For example, Intel's Quick Path Interconnect, AMD's Hyper Transport), and DRAM controllers; b) Non-uniform memory access (NUMA) where the time for memory access between a core and main memory is not uniform and where main memory is distributed between locality domains or groups called NUMA nodes; and c) Dynamic power management (DPM) of multiple power domains (CPU sockets, DRAM). Lastovetsky and Reddy [8], Reddy and Lastovetsky [61] propose data partitioning algorithms that solve singleobjective optimization problems of data-parallel applications for performance or energy on homogeneous clusters of multicore CPUs. They take as an input, discrete performance and dynamic energy functions with no shape assumptions that accurately and realistically account for resource contention and NUMA inherent in modern multicore CPU platforms.…”
Section: B Overview Of Application-level Techniquesmentioning
confidence: 99%
“…The cost of building the full speed functions of the abstract processors can be expensive. To reduce the cost, one approach is to build partial speed functions that are input to HiPOPTA to output optimal workload distribution for the specific input speed functions [7], [17].…”
Section: Hipopta: Hierarchical Two-level Data Partitioning Algorithm Solving Hipoptmentioning
confidence: 99%
“…This is because, for each data point, statistical averaging is performed, which involves multiple runs of the application to determine the sample means for the execution times. To reduce the cost, one approach is to build partial speed functions [7], [17], which are input to HiPOPTA to output optimal workload distribution for the specific input speed functions.…”
Section: B Speed/performance Functions Of the Applicationsmentioning
confidence: 99%