Support for Dependency Driven Executions among OpenMP Tasks

Ghosh, Priyanka; Chapman, Barbara

doi:10.1109/dfm.2012.16

Cited by 5 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OpenUH [9] supports task dependencies using IDs [9] that are used to specify dependencies between tasks and removes the overhead of dynamically finding the dependencies, but expects more from the programmer. Conceptually our dep pattern clause converts a dynamic data-driven task graph to a graph such as OpenUHs.…”

Section: Related Workmentioning

confidence: 99%

“…This might not a problem in High-Performance-Computing applications (where data-sets are very large), managing parallelism from smaller workloads is very hard. Some run-time systems solved the overhead cost by forcing the programmer to specify dependencies manually (see for instance OpenSTREAM [14] and OpenUH [9]) but at the same time reducing user-friendliness. Some major benefits of task-parallel programming is the promise of composability, i.e.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TurboBŁYSK: Scheduling for Improved Data-Driven Task Performance with Fast Dependency Resolution

Podobas

Brorsson

Vlassov

2014

Using and Improving OpenMP for Devices, Tasks, and More

View full text Add to dashboard Cite

Citation for the original published paper:Podobas, A., Brorsson, M., Vlassov, V. (2014) TurboBŁYSK: Scheduling for improved data-driven task performance with fast dependency resolution.In: Using and Improving OpenMP for Devices, Tasks, and More: 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings (pp. 45-57 Abstract. Data-driven task-parallelism is attracting growing interest and has now also been added to OpenMP (4.0). This paradigm simplifies the writing of parallel applications, extracting parallelism and the use of distributed memory architectures. While the programming model itself is becoming mature, a problem with current run-time scheduler implementations is that they require very large task granularity in order to scale, something which goes at odds with the idea of task-parallel programing where programmers should be able to concentrate on exposing parallelism with little regard to the task granularity. To mitigate this limitation, we have designed and implemented a highly efficient run-time scheduler of tasks with explicit data-dependence annotations: TurboB LYSK. We propose a novel pattern-saving mechanism that allows the scheduler to re-use previously resolved dependency patterns, based on programmer annotations, enabling programs to use even the smallest of tasks to scale well. We experimentally show that our techniques in TurboB LYSK can achieve nearly twice the peak-performance compared with other runtime schedulers. Our techniques are not OpenMP specific and could be implemented in other task-parallel frameworks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

TurboBŁYSK: Scheduling for Improved Data-Driven Task Performance with Fast Dependency Resolution

Podobas

Brorsson

Vlassov

2014

Using and Improving OpenMP for Devices, Tasks, and More

View full text Add to dashboard Cite

show abstract

“…cies between different tasks [104]. This is used in StarPU as well as other tasking models for heterogeneous systems including OmpSs and StarSs, as described below.…”

mentioning

confidence: 99%

Performance-aware component composition for GPU-based systems

Dastgeer¹

2014

View full text Add to dashboard Cite

This thesis adresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations for each computation; performance is achieved/retained either a) by selecting a suitable implementation for each computation on a given platform or b) by dividing the computation work across different implementations running on CPU and GPU devices in parallel.In the first approach, we work on a skeleton programming library (SkePU) that provides high-level abstraction while making intelligent implementation selection decisions underneath either before or during the actual program execution. In the second approach, we develop a composition tool that parses extra information (metadata) from XML files, makes certain decisions offline, and, in the end, generates code for making the final decisions at runtime. The third approach is a framework that uses source-code annotations and program analysis to generate code for the runtime library to make the selection decision at runtime. With a generic performance modeling API alongside program analysis capabilities, it supports online tuning as well as complex program transformations.These approaches differ in terms of genericity, intrusiveness, capabilities and knowledge about the program source-code; however, they all demonstrate usefulness of component programming techniques for programming GPU-based systems. With experimental evaluation, we demonstrate how all three approaches, although different in their own way, provide good performance on different GPU-based systems for a variety of applications.This work has been supported by two EU FP7 projects (PEP-PHER, EXCESS) and by SeRC. Populärvetenskaplig sammanfattningAtt få varje generation av datorer att fungera snabbareär viktigt för samhä-llets utveckling och tillväxt. Traditionellt hade de flesta datorer bara en general-purpose processor (den så kallade CPU:n) som bara kunde exekvera en beräkningsuppgift i taget. Under det senasteårtiondet har dock flerkärniga och mångkärniga processorer blivit vanliga, och datorer har också blivit mer heterogena. Ett modernt datorsystem innehåller vanligtvis flerän en CPU, tillsammans med specialprocessorer såsom grafikprocessorer (GPU:er) som ar anpassade för att kunna exekvera vissa typer av beräkningar effektivarë an CPU:er. Vi kallar ett sådant system med en eller flera GPU:er för ett GPU-baserat system. GPU:er i sådana system har sitt eget separata minne, och för att kunna köra en beräkning på en GPU så behöver man vanligtvis flytta all indata till GPU:ns minne och sedan hämta tillbaka resultatet när beräkningenär klar.Programmeringen av GPU-baserade systemär icke-trivialt av flera anledningar: (1) CPU:er och GPU:er kräver olika programmeringsexper...

show abstract