Heterogeneous parallel_for Template for CPU–GPU Chips

Navarro, Ángeles; Corbera, Francisco; Rodríguez, Andrés; Vilches, Antonio; Asenjo, Rafael

doi:10.1007/s10766-018-0555-0

Cited by 19 publications

(29 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To partition generic programs dynamically at runtime, several authors have shown that partitioning on a data level is a viable option for both regular and irregular problems. [26][27][28][29] Some works [30][31][32] tackle the problem of accelerating numerical algorithms such as matrix multiplication and the fast Fourier transform on heterogeneous systems using data partitioning approaches. In ABSs, however, we observe a strong locality of dependencies, as agents primarily interact with nearby agents, that is, their neighbors.…”

Section: Related Workmentioning

confidence: 99%

OpenABLext: An automatic code generation framework for agent‐based simulations on CPU‐GPU‐FPGA heterogeneous platforms

Xiao

Andelfinger

Cai

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

The execution of agent-based simulations (ABSs) on hardware accelerator devices such as graphics processing units (GPUs) has been shown to offer great performance potentials. However, in heterogeneous hardware environments, it can become increasingly difficult to find viable partitions of the simulation and provide implementations for different hardware devices. To automate this process, we present OpenABLext, an extension to OpenABL, a model specification language for ABSs. By providing a device-aware OpenCL backend, OpenABLext enables the co-execution of ABS on heterogeneous hardware platforms consisting of central processing units, GPUs, and field programmable gate arrays (FPGAs). We present a novel online dispatching method that efficiently profiles partitions of the simulation during run-time to optimize the hardware assignment while using the profiling results to advance the simulation itself. In addition, OpenABLext features automated conflict resolution based on user-specified rules, supports graph-based simulation spaces, and utilizes an efficient neighbor search algorithm. We show the improved performance of OpenABLext and demonstrate the potential of FPGAs in the context of ABS. We illustrate how co-execution can be used to further lower execution times. OpenABLext can be seen as an enabler to tap the computing power of heterogeneous hardware platforms for ABS.

show abstract

Section: Related Workmentioning

confidence: 99%

OpenABLext: An automatic code generation framework for agent‐based simulations on CPU‐GPU‐FPGA heterogeneous platforms

Xiao

Andelfinger

Cai

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Vilches et al [49] developed a novel adaptive partitioning algorithm for parallel loops to find the appropriate chunk size for GPUs and CPUs. Navarro et al [50] also studied the partitioning strategy for parallel loops, specially for irregular applications on heterogeneous CPU-GPU architectures. Sakai et al [51] proposed a novel decomposition method that can execute single-GPU code on multi-GPU systems.…”

Section: Related Workmentioning

confidence: 99%

Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures

Zhang

Zhai

et al. 2019

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications such as graph processing. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of running irregular applications on integrated architectures. Furthermore, fine-grained co-processing on the CPU and GPU is still an open problem. Particularly, in this paper, we show that the previous workload partitioning for CPU-GPU co-processing is far from ideal in terms of resource utilization and performance. To solve this problem, we propose a system software called FinePar, which considers architectural differences of the CPU and GPU and leverages fine-grained collaboration enabled by integrated architectures. Through irregularity-aware performance modeling and online auto-tuning, FinePar partitions irregular workloads and achieves both device-level and thread-level load balance. We evaluate FinePar with eight irregular applications in graphs and sparse matrices on two integrated architectures and compare it with state-of-the-art partitioning approaches. Results show that FinePar demonstrates better resource utilization and achieves an average of 1.6X speedup over the optimal coarse-grained partitioning method.

show abstract

“…The user still has to provide the FPGA kernel, but the TBB based run-time takes care of evenly partitioning the iteration space and making the data accessible to both the CPU and FPGA. To do so, we select a state-of-the-art high-level scheduler called LogFit [19,7] that was recently developed for CPU+GPU chips and we extend it to support simultaneous computing on CPU+FPGA. LogFit has been also used for Xilinx chips composed of ARM cores and FPGA [20], but in that case, instead of OpenCL, SDSoC and C were used to generate the FPGA computing units for regular applications.…”

Section: Background and Related Workmentioning

confidence: 99%

“…This work addresses these challenges by combining OpenCL -to code the functions that can be executed on the FPGA-with Threading Building Blocks (TBB [6]) -to orchestrate and distribute the iterations of a parallel for among the cores and the FPGA-. We have recently proposed a scheduling algorithm that dynamically distributes different chunks 1 of the parallel iteration space among CPU cores and a GPU [7]. To this end, the scheduler monitors the throughput of each computing unit during the execution of the iterations and uses this metric to adaptively resize the CPU and GPU chunks in order to optimize overall throughput and to prevent underutilization and load unbalance between GPU and CPU cores.…”

Section: Introductionmentioning

confidence: 99%

Exploring heterogeneous scheduling for edge computing with CPU and FPGA MPSoCs

Rodríguez

Navarro

Asenjo

et al. 2019

Journal of Systems Architecture

Self Cite

View full text Add to dashboard Cite

This paper presents a framework targeted to low-cost and low-power heterogeneous MultiProcessors that exploits FPGAs and multicore CPUs, with the overarching goal of providing developers with a productive programming model and runtime support to fully use all the processing resources available. FPGA productivity is achieved using a high-level programming model based on OpenCL, the standard for cross-platform parallel heterogeneous programming. In this work, we focus on the parallel for pattern, and as part of the runtime support for this pattern, we leverage a new scheduler that strives to maximize the number of iterations per joule by dynamically and adaptively partitioning the iteration space between the multicore and the accelerator when working simultaneously. A total of 7 benchmarks are ported and optimized for a low-cost DE1 board. The results show that the heterogeneous solution can improve performance up to 2.9x and increases energy efficiency up to 2.7x compared to the traditional approach of keeping all the CPU cores idle while the accelerator computes the workload. Our results also demonstrate two interesting insights: first, an adaptive scheduler able to find at runtime the right chunk size for each type of application and device configuration is an essential component for these kinds of heterogeneous platforms, and second, device configurations that provide higher throughput do not always achieve better energy efficiency when only the running power (excluding the idle power component) is considered.

show abstract

Heterogeneous parallel_for Template for CPU–GPU Chips

Cited by 19 publications

References 17 publications

OpenABLext: An automatic code generation framework for agent‐based simulations on CPU‐GPU‐FPGA heterogeneous platforms

OpenABLext: An automatic code generation framework for agent‐based simulations on CPU‐GPU‐FPGA heterogeneous platforms

Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures

Exploring heterogeneous scheduling for edge computing with CPU and FPGA MPSoCs

Contact Info

Product

Resources

About