Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code

Wang, Zheng; Powell, Daniel; Franke, Björn; O’Boyle, Michael

doi:10.1007/978-3-642-54807-9_9

Cited by 16 publications

(11 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In all preceding approaches, the code always gets executed on the GPU. Prior work on automatic generation of parallel GPU code from sequential programs also includes Par4ALL [Amini et al 2012], PPCG [Verdoolaege et al 2013], and that of Wang et al [2014a]. Unlike our approach, they do not consider the problem of selecting the most suitable device from the host CPU and the GPU to run the code.…”

Section: Related Workmentioning

confidence: 91%

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Wang

Grewe

O’Boyle

2014

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators.

show abstract

Section: Related Workmentioning

confidence: 91%

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Wang

Grewe

O’Boyle

2014

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The Open-MPC compiler [69] translates OpenMP to CUDA programs. Wang et al [20], [24], [70] translates OpenMP to OpenCL programs and use machine learning to select the most suitable device from the host CPU and the GPU to run the code. Rawat et al presents an automatic approach to generate GPU code from a domain-specific language (DSL) for stencil programs [71].…”

Section: Domain-specific Optimizationsmentioning

confidence: 99%

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Zhang

Fang

Yang

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks -a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

show abstract

“…Machine learning has been employed for various optimization tasks [40], including code optimization [7,12,29,30,37,39,41,42,43,44,45,46,51], task scheduling [9,10,11,33,34], model selection [38], etc.…”

Section: Related Workmentioning

confidence: 99%

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Chen

Fang

Chen

et al. 2019

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

Sparse matrix-vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9,500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.

show abstract

Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code

Cited by 16 publications

References 28 publications

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

Contact Info

Product

Resources

About