Exploiting Task Parallelism with OpenCL: A Case Study

Jääskeläinen, Pekka; Korhonen, Ville; Koskela, Matias; Takala, Jarmo; Егиазарян, Карен; Danielyan, Aram; Cruz, Cristóvão; Price, James; McIntosh–Smith, Simon

doi:10.1007/s11265-018-1416-1

Cited by 11 publications

(7 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, the last step of the profiling phase, Step C, is performed. At this point, the acceleration between both offloading modes is computed, determining the best strategy to be used with the device being profiled: Thus, values higher than 1.0 indicate that the device has a beneficial behavior when facing workload splitting strategies, allowing an increase in throughputs by taking advantage of multiple command queues, overlap between computation and communication as well as appropriate interleaving between management and computation, as demonstrated in previous studies [17,19,[21][22][23][24]. And therefore, values lower than 1.0 indicate that it suffers penalization for device management and chunk synchronization, sharing of CPU usage with the simulator itself or other tasks and even an indication of very short execution times, where the generation of multiple chunks is usually counterproductive.…”

Section: Mash Algorithmmentioning

confidence: 99%

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

Nozal

Bosque

2022

J Supercomput

View full text Add to dashboard Cite

The path to the efficient exploitation of molecular dynamics simulators is strongly driven by the increasingly intensive use of accelerators. However, they suffer performance portability issues, making it necessary both to achieve technological combinations that allow taking advantage of each programming model and device, and to define more effective load distribution strategies that consider the simulation conditions. In this work, a new load balancing algorithm is presented, together with a set of optimizations to support hybrid co-execution in a runtime system for heterogeneous computing. The new extended design enables the exploitation of custom kernels and acceleration technologies altogether, being encapsulated for the rest of the runtime and its scheduling system. With this support, Mash algorithm allows to simultaneously leverage different workload distribution strategies, benefiting from the most advantageous one per device and technology. Experiments show that these proposals achieve an efficiency close to 0.90 and an energy efficiency improvement around 1.80 over the original optimized version.

show abstract

Section: Mash Algorithmmentioning

confidence: 99%

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

Nozal

Bosque

2022

J Supercomput

View full text Add to dashboard Cite

show abstract

“…There is no explicit concept of a task graph in the standard, but instead the runtime can construct the graphs implicitly: The devices are controlled using a concept of a command queue, which supports event synchronization between the kernels pushed to the devices through them. The concepts of a kernel (a function executed on a device), a command queue (a way to push work to a device), and an event (signaled by a completion of a command that can be waited on) together can be used to form complex parallel heterogeneous computing across a diverse heterogeneous platform with multiple different device types (Jääskeläinen et al, 2019). Hahnfeld et al (2018) show that to get significant speedups from executing heterogeneous task graphs, the devices must be able to interoperate efficiently to interleave the communication with the computation.…”

Section: Diverse Heterogeneous Platform Software Layermentioning

confidence: 99%

Cross-vendor programming abstraction for diverse heterogeneous platforms

Leppänen

Lotvonen²,

Jääskeläinen³

2022

Front. Comput. Sci.

View full text Add to dashboard Cite

Hardware specialization is a well-known means to significantly improve the performance and energy efficiency of various application domains. Modern computing systems consist of multiple specialized processing devices which need to collaborate with each other to execute common tasks. New heterogeneous programming abstractions have been created to program heterogeneous systems. Even though many of these abstractions are open vendor-independent standards, cross-vendor interoperability between different implementations is limited since the vendors typically do not have commercial motivations to invest in it. Therefore, getting good performance from vendor-independent heterogeneous programming standards has proven difficult for systems with multiple different device types. In order to help unify the field of heterogeneous programming APIs for platforms with hardware accelerators from multiple vendors, we propose a new software abstraction for hardware-accelerated tasks based on the open OpenCL programming standard. In the proposed abstraction, we rely on the built-in kernel feature of the OpenCL specification to define a portability layer that stores enough information for automated accelerator utilization. This enables the portability of high-level applications to a diverse set of accelerator devices with minimal programmer effort. The abstraction enables a layered software architecture that provides for an efficient combination of application phases to a single asynchronous application description from multiple domain-specific input languages. As proofs of the abstraction layer serving its purpose for the layers above and below it, we show how a domain-specific input description ONNX can be implemented on top of this portability abstraction, and how it also allows driving fixed function and FPGA-based hardware accelerators below in the hardware-specific backend. We also provide an example implementation of the abstraction to show that the abstraction layer does not seem to incur significant execution time overhead.

show abstract

“…Event dependencies are mapped to platform-local events on each server and events for commands running on other servers are substituted with user events. This way the heterogeneous task graph based on event dependencies defined by the application stays intact on the remote servers and the runtime can apply optimisations utilizing the dependency rules outlined in [12].…”

Section: Decentralized Command Schedulingmentioning

confidence: 99%