Abstract:General purpose Gpus provide massive compute power, but are notoriously difficult to program. In this paper we present a complete compilation strategy to exploit Gpus for the parallelisation of sequential legacy code. Using hybrid data dependence analysis combining static and dynamic information, our compiler automatically detects suitable parallelism and generates parallel OpenCl code from sequential programs. We exploit the fact that dependence profiling provides us with parallel loop candidates that are hig… Show more
“…In all preceding approaches, the code always gets executed on the GPU. Prior work on automatic generation of parallel GPU code from sequential programs also includes Par4ALL [Amini et al 2012], PPCG [Verdoolaege et al 2013], and that of Wang et al [2014a]. Unlike our approach, they do not consider the problem of selecting the most suitable device from the host CPU and the GPU to run the code.…”
General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators.
“…In all preceding approaches, the code always gets executed on the GPU. Prior work on automatic generation of parallel GPU code from sequential programs also includes Par4ALL [Amini et al 2012], PPCG [Verdoolaege et al 2013], and that of Wang et al [2014a]. Unlike our approach, they do not consider the problem of selecting the most suitable device from the host CPU and the GPU to run the code.…”
General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators.
“…The Open-MPC compiler [69] translates OpenMP to CUDA programs. Wang et al [20], [24], [70] translates OpenMP to OpenCL programs and use machine learning to select the most suitable device from the host CPU and the GPU to run the code. Rawat et al presents an automatic approach to generate GPU code from a domain-specific language (DSL) for stencil programs [71].…”
As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks -a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.
“…Machine learning has been employed for various optimization tasks [40], including code optimization [7,12,29,30,37,39,41,42,43,44,45,46,51], task scheduling [9,10,11,33,34], model selection [38], etc.…”
Sparse matrix-vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9,500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.