On the support of task-parallel algorithmic skeletons for multi-GPU computing

Alexandre, Fernando; Marques, Ricardo Manuel Fernandes; Paulino, Hervé

doi:10.1145/2554850.2555018

Cited by 7 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experiment is repeated for different input data sizes. The same experiments are conducted with the existing Marrow framework [6] [18], and the obtained results are compared with the proposed workload division strategy. The Marrow framework uses OpenCL computations on both CPU and GPUs on the heterogeneous cluster and static workload division strategy to distribute the workload manually based on the CPU and GPUs' hardware configuration.…”

Section: Linpack Benchmark Applicationmentioning

confidence: 99%

Performance Driven Analytical Workload Division Model for the HPC Applications on CPU-GPU Heterogeneous Cluster

N¹,

A²,

Murthy

et al. 2022

Preprint

View full text Add to dashboard Cite

High-Performance Computing is the cornerstone for many scientific and industrial innovations. The demand for high-performance computing power is one of the driving factors for the innovations of computer hardware. In the hybrid system, CPUs and GPUs are combined to produce better performance while executing HPC applications. The critical challenge to achieving better performance in a heterogeneous cluster is the efficient distribution of the workload among the CPUs/GPUs in the nodes. In this work, to address the distribution workload issue, an optimized analytical workload division model for the heterogeneous cluster is developed to efficiently distribute the workload among the nodes of a heterogeneous cluster. The analytical model considers workload, processing capabilities, and the number of CPUs and GPUs on the cluster to effectively distribute the workload. HPL and merge sort benchmark applications are used to test the proposed strategy. Workload division strategy is tested by conducting extensive experiments. To address the inter-node and intra-node communication challenge, pinned memory technique is used along with a single MPI process per node technique and CUDA IPC. The proposed workload division strategy is validated with the HPL application and Merge sort. Experiments show that the proposed workload division strategy performs much better than the existing works.

show abstract

Section: Linpack Benchmark Applicationmentioning

confidence: 99%

Performance Driven Analytical Workload Division Model for the HPC Applications on CPU-GPU Heterogeneous Cluster

N¹,

A²,

Murthy

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We establish this order relation for both integer and floating-point arithmetic by running the SHOC benchmark suite [20] at the framework's installation-time. This static approach, although simple, delivers good performance results for GPU-accelerated executions [9], due to the specialized nature of the underlying execution model: one kernel execution at a time, with no preemption and no input/output operations. These premises are not valid for CPU executions.…”

Section: Workload Distributionmentioning

confidence: 99%

“…We claim that these characteristics can be used to: (a) hide the heterogeneity of the underlying hardware and, (b) provide tools to cope with such heterogeneity, enabling device-specific problem decompositions and optimizations. To that extent, we have been developing the Marrow algorithmic skeleton framework [8,9,10] for the orchestration of OpenCL computations. Marrow offers both data and task-parallel skeletons and is the first framework on the GPU computing field to support skeleton composition, through nesting.…”

Section: Introductionmentioning

confidence: 99%

Execution of compound multi‐kernel OpenCL computations in multi‐CPU/multi‐GPU environments

Soldado

Alexandre

Paulino

2015

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SUMMARYCurrent computational systems are heterogeneous by nature, featuring a combination of CPUs and graphics processing units (GPUs). As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming of these hybrid systems as a whole. The distinct nature of the architectural and execution models in place raises several challenges, as the best hardware configuration is behavior and workload dependent. In this paper, we address the execution of compound, multi-kernel, open computing language computations in multi-CPU/multi-GPU environments. We address how these computations may be efficiently scheduled onto the target hardware, and how the system may adapt itself to changes in the workload to process and to fluctuations in the CPU's load. An experimental evaluation attests the performance gains obtained by the conjoined use of the CPU and GPU devices, when compared with GPU-only executions, and also by the use of data-locality optimizations in CPU environments.

show abstract

“…In [9] we addressed this issue for heterogeneous multi-GPU environments. The workload is statically distributed among the devices, according to their relative performance.…”

Section: Work-load Distributionmentioning

confidence: 99%

“…We claim that these characteristics can be used to, on one hand, hide the heterogeneity of the underlying hardware and, on the other, provide tools to cope with such heterogeneity, enabling device-specific parallel decompositions and optimizations. To that extent, we have been developing an algorithmic skeleton framework, entitled Marrow [8,9], for the orchestration of OpenCL kernels. Marrow offers both data and task-parallel skeletons, and is the first on the GPU computing field to support skeleton composition, through nesting.…”

Section: Introductionmentioning

confidence: 99%

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Soldado

Alexandre

Paulino

2014

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Current computational systems are heterogeneous by nature, featuring a combination of CPUs and GPUs. As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming of the heterogeneous systems as a whole. The distinct nature of the architectural and execution models in place raise several challenges, as the best hardware configuration is behavior and data-set dependent. In this paper, we focus the execution of compound computations in multi-CPU/multi-GPU environments, in the scope of Marrow algorithmic skeleton framework, the only, to the best of our knowledge, to support skeleton nesting in GPU computing. We address how these computations may be efficiently scheduled onto the target hardware, and how the system may adapt itself to changes in the CPU's load and in the input data-set.

show abstract

On the support of task-parallel algorithmic skeletons for multi-GPU computing

Cited by 7 publications

References 11 publications

Performance Driven Analytical Workload Division Model for the HPC Applications on CPU-GPU Heterogeneous Cluster

Performance Driven Analytical Workload Division Model for the HPC Applications on CPU-GPU Heterogeneous Cluster

Execution of compound multi‐kernel OpenCL computations in multi‐CPU/multi‐GPU environments

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Contact Info

Product

Resources

About