Optimizing locality and scalability of embedded Runge–Kutta solvers using block-based pipelining

Korch, Matthias; Rauber, Thomas

doi:10.1016/j.jpdc.2005.09.003

Cited by 27 publications

(45 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Next, we compare the SkePU overhead by comparing a real-world application execution with SkePU framework to execution with its direct hand-coded implementations. Figure 3.8 compares SkePU execution of a Runge-Kutta ODE solver [138] with direct execution using [138] for execution using SkePU in comparison to execution using a hand-coded version for three different backends (CPU C++, OpenMP, CUDA) on System A. a hand-written implementation for C++, OpenMP and CUDA. The overhead of SkePU implementations is negligible in all cases for even such a large application containing 9 different types of skeleton calls executed over 1160 times, in total.…”

Section: Skepu Overhead Analysismentioning

confidence: 99%

“…The optimal schedule for this experiment can be obtained by re-ordering both calls (see Figure 6.9a) and using HEFT afterwards. Figure 6.12: Comparison of tool-generated performanceaware (TGPA) composition using bulk scheduling and selection heuristic with HEFT and direct (OpenMP, CUDA) execution for ODE solver component calls [138] on System A. Finding a practical heuristic: Although possibly sub-optimal, HEFT is still a practical solution for such a set of independent tasks.…”

Section: Farm Scheduling and Selectionmentioning

confidence: 99%

“…Listing 5 shows such a scenario with a code portion of a RungeKutta ODE Solver from the LibSolve library [138] containing multiple component calls with data dependency between them. The scheduler needs to consider such a group of component calls together when dispatching the first call.…”

Section: Bulk Scheduling and Selectionmentioning

confidence: 99%

See 2 more Smart Citations

Performance-aware component composition for GPU-based systems

Dastgeer¹

2014

View full text Add to dashboard Cite

This thesis adresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations for each computation; performance is achieved/retained either a) by selecting a suitable implementation for each computation on a given platform or b) by dividing the computation work across different implementations running on CPU and GPU devices in parallel.In the first approach, we work on a skeleton programming library (SkePU) that provides high-level abstraction while making intelligent implementation selection decisions underneath either before or during the actual program execution. In the second approach, we develop a composition tool that parses extra information (metadata) from XML files, makes certain decisions offline, and, in the end, generates code for making the final decisions at runtime. The third approach is a framework that uses source-code annotations and program analysis to generate code for the runtime library to make the selection decision at runtime. With a generic performance modeling API alongside program analysis capabilities, it supports online tuning as well as complex program transformations.These approaches differ in terms of genericity, intrusiveness, capabilities and knowledge about the program source-code; however, they all demonstrate usefulness of component programming techniques for programming GPU-based systems. With experimental evaluation, we demonstrate how all three approaches, although different in their own way, provide good performance on different GPU-based systems for a variety of applications.This work has been supported by two EU FP7 projects (PEP-PHER, EXCESS) and by SeRC. Populärvetenskaplig sammanfattningAtt få varje generation av datorer att fungera snabbareär viktigt för samhä-llets utveckling och tillväxt. Traditionellt hade de flesta datorer bara en general-purpose processor (den så kallade CPU:n) som bara kunde exekvera en beräkningsuppgift i taget. Under det senasteårtiondet har dock flerkärniga och mångkärniga processorer blivit vanliga, och datorer har också blivit mer heterogena. Ett modernt datorsystem innehåller vanligtvis flerän en CPU, tillsammans med specialprocessorer såsom grafikprocessorer (GPU:er) som ar anpassade för att kunna exekvera vissa typer av beräkningar effektivarë an CPU:er. Vi kallar ett sådant system med en eller flera GPU:er för ett GPU-baserat system. GPU:er i sådana system har sitt eget separata minne, och för att kunna köra en beräkning på en GPU så behöver man vanligtvis flytta all indata till GPU:ns minne och sedan hämta tillbaka resultatet när beräkningenär klar.Programmeringen av GPU-baserade systemär icke-trivialt av flera anledningar: (1) CPU:er och GPU:er kräver olika programmeringsexper...

show abstract

Section: Skepu Overhead Analysismentioning

confidence: 99%

Section: Farm Scheduling and Selectionmentioning

confidence: 99%

See 1 more Smart Citation

Performance-aware component composition for GPU-based systems

Dastgeer¹

2014

View full text Add to dashboard Cite

show abstract

“…The scalability of such general implementations is therefore often not satisfactory. To overcome the limitations of general implementations, two possible approaches take advantage of special properties of either the embedded RK method [23] or the ODE system to be solved [29,31]. In the following, we follow the second approach of taking advantage of special properties of the ODE system and investigate the scalability of data-parallel implementations of embedded RK methods.…”

Section: Introductionmentioning

confidence: 99%

Parallel Low-Storage Runge—Kutta Solvers for ODE Systems with Limited Access Distance

Korch

Rauber

2010

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

show abstract

“…In a shared memory system we only have to set a flag that the data is ready for the next process and no data transfer must take pace. The pipelining technique is described in [10]. Here we aim to automatically detect pipelining possibilities in the total task graph containing both the solver stages and the right hand side of the system, and automatically generate parallelized code optimized for the specific latency and bandwidth parameters of the target machine.…”

Section: Combining Parallelization At Several Levelsmentioning

confidence: 99%

Automatic Parallelization of Object Oriented Models Executed with Inline Solvers

Lundvall

Fritzson

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this work we report preliminary results of automatically generating parallel code from equation-based models together at two levels: Performing inline expansion of a Runge-Kutta solver combined with fine-grained automatic parallelization of the resulting RHS opens up new possibilities for generating high performance code, which is becoming increasingly relevant when multi-core computers are becoming common-place. We have introduced a new way of scheduling the task graph generated from the simulation problem which utilizes knowledge about locality of the simulation problem. The scheduling is also done in a way that limits communication, to the greatest extent possible, to neighboring processors and expensive global synchronization is avoided. Preliminary tests on a PC-cluster show speedup that is better than what was achieved in previous work were parallelization was done on the system only.

show abstract

Optimizing locality and scalability of embedded Runge–Kutta solvers using block-based pipelining

Cited by 27 publications

References 34 publications

Performance-aware component composition for GPU-based systems

Performance-aware component composition for GPU-based systems

Parallel Low-Storage Runge—Kutta Solvers for ODE Systems with Limited Access Distance

Automatic Parallelization of Object Oriented Models Executed with Inline Solvers

Contact Info

Product

Resources

About