A Survey of CPU-GPU Heterogeneous Computing Techniques

Mittal, Sparsh; Vetter, Jeffrey S.

doi:10.1145/2788396

Cited by 424 publications

(214 citation statements)

References 155 publications

Supporting

Mentioning

196

Contrasting

Unclassified

Order By: Relevance

“…While the word has been loosely used earlier, this study has provided guidance on how we should think about components, and thus, this definition will be followed for the rest of the discussion in this section. Fine-grained parallelism approaches, such as those surveyed in Mittal and Vetter (2015), may be applied within a component as so defined, but are likely to fail above that level. A substantial increase in overall scalability of an ESM may be achieved if several components are run concurrently.…”

Section: Discussionmentioning

confidence: 99%

Coarse-grained component concurrency in Earth system modeling: parallelizing atmospheric radiative transfer in the GFDL AM3 model using the Flexible Modeling System coupling framework

et al. 2016

View full text Add to dashboard Cite

Abstract. Climate models represent a large variety of processes on a variety of timescales and space scales, a canonical example of multi-physics multi-scale modeling. Current hardware trends, such as Graphical Processing Units (GPUs) and Many Integrated Core (MIC) chips, are based on, at best, marginal increases in clock speed, coupled with vast increases in concurrency, particularly at the fine grain. Multiphysics codes face particular challenges in achieving finegrained concurrency, as different physics and dynamics components have different computational profiles, and universal solutions are hard to come by.We propose here one approach for multi-physics codes. These codes are typically structured as components interacting via software frameworks. The component structure of a typical Earth system model consists of a hierarchical and recursive tree of components, each representing a different climate process or dynamical system. This recursive structure generally encompasses a modest level of concurrency at the highest level (e.g., atmosphere and ocean on different processor sets) with serial organization underneath.We propose to extend concurrency much further by running more and more lower-and higher-level components in parallel with each other. Each component can further be parallelized on the fine grain, potentially offering a major increase in the scalability of Earth system models.We present here first results from this approach, called coarse-grained component concurrency, or CCC. Within the Geophysical Fluid Dynamics Laboratory (GFDL) Flexible Modeling System (FMS), the atmospheric radiative transfer component has been configured to run in parallel with a composite component consisting of every other atmospheric component, including the atmospheric dynamics and all other atmospheric physics components. We will explore the algorithmic challenges involved in such an approach, and present results from such simulations. Plans to achieve even greater levels of coarse-grained concurrency by extending this approach within other components, such as the ocean, will be discussed.

show abstract

Section: Discussionmentioning

confidence: 99%

Coarse-grained component concurrency in Earth system modeling: parallelizing atmospheric radiative transfer in the GFDL AM3 model using the Flexible Modeling System coupling framework

et al. 2016

View full text Add to dashboard Cite

show abstract

“…OpenCL (Open Computing Language) is a recent standard, which is ratified by the Khronos Group, for cross-platform parallel programming with diverse processors [14]. OpenCL is welcomed for its portability, but it cannot achieve the highest possible performance for its high-level abstraction [15]. Brook [16] is an extension to the C-language for stream programming that was originally developed by Stanford University; Brook+ is an implementation of the Brook GPU specification on AMD's compute abstraction layer.…”

Section: Heterogeneous Computingmentioning

confidence: 99%

“…A scheduling strategy should consider both the internal characteristics of target algorithms and the external hardware attributes of the underlying PUs to determine suitable task partitioning and allocation. Currently, Many research focuses on algorithm-level workload partitioning and scheduling [15]. Workload partitioning techniques have been designed based on the relative performance of PUs [20,24], the nature of subtasks [25], or other partitioning criteria for different algorithms and applications.…”

Section: Heterogeneous Computingmentioning

confidence: 99%

A Hybrid Parallel Spatial Interpolation Algorithm for Massive LiDAR Point Clouds on Heterogeneous CPU-GPU Systems

Wang

Guan

2017

IJGI

View full text Add to dashboard Cite

Nowadays, heterogeneous CPU-GPU systems have become ubiquitous, but current parallel spatial interpolation (SI) algorithms exploit only one type of processing unit, and thus result in a waste of parallel resources. To address this problem, a hybrid parallel SI algorithm based on a thin plate spline is proposed to integrate both the CPU and GPU to further accelerate the processing of massive LiDAR point clouds. A simple yet powerful parallel framework is designed to enable simultaneous CPU-GPU interpolation, and a fast online training method is then presented to estimate the optimal decomposition granularity so that both types of processing units can run at maximum speed. Based on the optimal granularity, massive point clouds are continuously partitioned into a collection of discrete blocks in a data processing flow. A heterogeneous dynamic scheduler based on the greedy policy is also proposed to achieve better workload balancing. Experimental results demonstrate that the computing power of the CPU and GPU is fully utilized under conditions of optimal granularity, and the hybrid parallel SI algorithm achieves a significant performance boost when compared with the CPU-only and GPU-only algorithms. For example, the hybrid algorithm achieved a speedup of 20.2 on one of the experimental point clouds, while the corresponding speedups of using a CPU or a GPU alone were 8.7 and 12.6, respectively. The interpolation time was reduced by about 12% when using the proposed scheduler, in comparison with other common scheduling strategies.

show abstract

“…Similarly, in compute-intensive applications, while utilizing the accelerating device, the host CPUs remain idle, which leads to waste of energy and performance. Approaches that intelligently manage the resources of host CPUs and accelerating devices to address such inefficiencies seem promising [68]. To achieve higher performance, scalability and energy efficiency, engineers often combine Central Processing Units (CPUs), Graphical Processing Units (GPUs), or Field Programmable Gate Arrays (FPGAs).…”

Section: Introductionmentioning

confidence: 99%

Using meta-heuristics and machine learning for software optimization of parallel computing systems: a systematic literature review

et al. 2018

View full text Add to dashboard Cite

While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a given execution context is a complex task, and therefore to address this issue researchers have proposed different approaches that use heuristic search or machine learning. In this paper, we undertake a systematic literature review to aggregate, analyze and classify the existing software optimization methods for parallel computing systems. We review approaches that use machine learning or meta-heuristics for software optimization at compile-time and run-time. we discuss challenges and future research directions. The results of this study may help to better understand the state-of-the-art techniques that use machine learning and meta-heuristics to deal with the complexity of software optimization for parallel computing systems. Furthermore, it may aid in understanding the limitations of existing approaches and identification of areas for improvement.

show abstract

A Survey of CPU-GPU Heterogeneous Computing Techniques

Cited by 424 publications

References 155 publications

Coarse-grained component concurrency in Earth system modeling: parallelizing atmospheric radiative transfer in the GFDL AM3 model using the Flexible Modeling System coupling framework

Coarse-grained component concurrency in Earth system modeling: parallelizing atmospheric radiative transfer in the GFDL AM3 model using the Flexible Modeling System coupling framework

A Hybrid Parallel Spatial Interpolation Algorithm for Massive LiDAR Point Clouds on Heterogeneous CPU-GPU Systems

Using meta-heuristics and machine learning for software optimization of parallel computing systems: a systematic literature review

Contact Info

Product

Resources

About