2018
DOI: 10.1007/s10766-018-0555-0
|View full text |Cite
|
Sign up to set email alerts
|

Heterogeneous parallel_for Template for CPU–GPU Chips

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
25
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 19 publications
(29 citation statements)
references
References 17 publications
1
25
0
Order By: Relevance
“…To partition generic programs dynamically at runtime, several authors have shown that partitioning on a data level is a viable option for both regular and irregular problems. [26][27][28][29] Some works [30][31][32] tackle the problem of accelerating numerical algorithms such as matrix multiplication and the fast Fourier transform on heterogeneous systems using data partitioning approaches. In ABSs, however, we observe a strong locality of dependencies, as agents primarily interact with nearby agents, that is, their neighbors.…”
Section: Related Workmentioning
confidence: 99%
“…To partition generic programs dynamically at runtime, several authors have shown that partitioning on a data level is a viable option for both regular and irregular problems. [26][27][28][29] Some works [30][31][32] tackle the problem of accelerating numerical algorithms such as matrix multiplication and the fast Fourier transform on heterogeneous systems using data partitioning approaches. In ABSs, however, we observe a strong locality of dependencies, as agents primarily interact with nearby agents, that is, their neighbors.…”
Section: Related Workmentioning
confidence: 99%
“…Vilches et al [49] developed a novel adaptive partitioning algorithm for parallel loops to find the appropriate chunk size for GPUs and CPUs. Navarro et al [50] also studied the partitioning strategy for parallel loops, specially for irregular applications on heterogeneous CPU-GPU architectures. Sakai et al [51] proposed a novel decomposition method that can execute single-GPU code on multi-GPU systems.…”
Section: Related Workmentioning
confidence: 99%
“…The user still has to provide the FPGA kernel, but the TBB based run-time takes care of evenly partitioning the iteration space and making the data accessible to both the CPU and FPGA. To do so, we select a state-of-the-art high-level scheduler called LogFit [19,7] that was recently developed for CPU+GPU chips and we extend it to support simultaneous computing on CPU+FPGA. LogFit has been also used for Xilinx chips composed of ARM cores and FPGA [20], but in that case, instead of OpenCL, SDSoC and C were used to generate the FPGA computing units for regular applications.…”
Section: Background and Related Workmentioning
confidence: 99%
“…This work addresses these challenges by combining OpenCL -to code the functions that can be executed on the FPGA-with Threading Building Blocks (TBB [6]) -to orchestrate and distribute the iterations of a parallel for among the cores and the FPGA-. We have recently proposed a scheduling algorithm that dynamically distributes different chunks 1 of the parallel iteration space among CPU cores and a GPU [7]. To this end, the scheduler monitors the throughput of each computing unit during the execution of the iterations and uses this metric to adaptively resize the CPU and GPU chunks in order to optimize overall throughput and to prevent underutilization and load unbalance between GPU and CPU cores.…”
Section: Introductionmentioning
confidence: 99%