2018
DOI: 10.1007/s11265-018-1416-1
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Task Parallelism with OpenCL: A Case Study

Abstract: While data parallelism aspects of OpenCL have been of primary interest due to the massively data parallel GPUs being on focus, OpenCL also provides powerful capabilities to describe task parallelism. In this article we study the task parallel concepts available in OpenCL and find out how well the different vendor-specific implementations can exploit task parallelism when the parallelism is described in various ways utilizing the command queues. We show that the vendor implementations are not yet capable of ext… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…Then, the last step of the profiling phase, Step C, is performed. At this point, the acceleration between both offloading modes is computed, determining the best strategy to be used with the device being profiled: Thus, values higher than 1.0 indicate that the device has a beneficial behavior when facing workload splitting strategies, allowing an increase in throughputs by taking advantage of multiple command queues, overlap between computation and communication as well as appropriate interleaving between management and computation, as demonstrated in previous studies [17,19,[21][22][23][24]. And therefore, values lower than 1.0 indicate that it suffers penalization for device management and chunk synchronization, sharing of CPU usage with the simulator itself or other tasks and even an indication of very short execution times, where the generation of multiple chunks is usually counterproductive.…”
Section: Mash Algorithmmentioning
confidence: 99%
“…Then, the last step of the profiling phase, Step C, is performed. At this point, the acceleration between both offloading modes is computed, determining the best strategy to be used with the device being profiled: Thus, values higher than 1.0 indicate that the device has a beneficial behavior when facing workload splitting strategies, allowing an increase in throughputs by taking advantage of multiple command queues, overlap between computation and communication as well as appropriate interleaving between management and computation, as demonstrated in previous studies [17,19,[21][22][23][24]. And therefore, values lower than 1.0 indicate that it suffers penalization for device management and chunk synchronization, sharing of CPU usage with the simulator itself or other tasks and even an indication of very short execution times, where the generation of multiple chunks is usually counterproductive.…”
Section: Mash Algorithmmentioning
confidence: 99%
“…There is no explicit concept of a task graph in the standard, but instead the runtime can construct the graphs implicitly: The devices are controlled using a concept of a command queue, which supports event synchronization between the kernels pushed to the devices through them. The concepts of a kernel (a function executed on a device), a command queue (a way to push work to a device), and an event (signaled by a completion of a command that can be waited on) together can be used to form complex parallel heterogeneous computing across a diverse heterogeneous platform with multiple different device types (Jääskeläinen et al, 2019). Hahnfeld et al (2018) show that to get significant speedups from executing heterogeneous task graphs, the devices must be able to interoperate efficiently to interleave the communication with the computation.…”
Section: Diverse Heterogeneous Platform Software Layermentioning
confidence: 99%
“…Event dependencies are mapped to platform-local events on each server and events for commands running on other servers are substituted with user events. This way the heterogeneous task graph based on event dependencies defined by the application stays intact on the remote servers and the runtime can apply optimisations utilizing the dependency rules outlined in [12].…”
Section: Decentralized Command Schedulingmentioning
confidence: 99%