2006
DOI: 10.1016/j.jpdc.2005.09.003
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing locality and scalability of embedded Runge–Kutta solvers using block-based pipelining

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
0

Year Published

2006
2006
2017
2017

Publication Types

Select...
4
2
2

Relationship

3
5

Authors

Journals

citations
Cited by 27 publications
(45 citation statements)
references
References 34 publications
0
45
0
Order By: Relevance
“…Next, we compare the SkePU overhead by comparing a real-world application execution with SkePU framework to execution with its direct hand-coded implementations. Figure 3.8 compares SkePU execution of a Runge-Kutta ODE solver [138] with direct execution using [138] for execution using SkePU in comparison to execution using a hand-coded version for three different backends (CPU C++, OpenMP, CUDA) on System A. a hand-written implementation for C++, OpenMP and CUDA. The overhead of SkePU implementations is negligible in all cases for even such a large application containing 9 different types of skeleton calls executed over 1160 times, in total.…”
Section: Skepu Overhead Analysismentioning
confidence: 99%
See 2 more Smart Citations
“…Next, we compare the SkePU overhead by comparing a real-world application execution with SkePU framework to execution with its direct hand-coded implementations. Figure 3.8 compares SkePU execution of a Runge-Kutta ODE solver [138] with direct execution using [138] for execution using SkePU in comparison to execution using a hand-coded version for three different backends (CPU C++, OpenMP, CUDA) on System A. a hand-written implementation for C++, OpenMP and CUDA. The overhead of SkePU implementations is negligible in all cases for even such a large application containing 9 different types of skeleton calls executed over 1160 times, in total.…”
Section: Skepu Overhead Analysismentioning
confidence: 99%
“…The optimal schedule for this experiment can be obtained by re-ordering both calls (see Figure 6.9a) and using HEFT afterwards. Figure 6.12: Comparison of tool-generated performanceaware (TGPA) composition using bulk scheduling and selection heuristic with HEFT and direct (OpenMP, CUDA) execution for ODE solver component calls [138] on System A. Finding a practical heuristic: Although possibly sub-optimal, HEFT is still a practical solution for such a set of independent tasks.…”
Section: Farm Scheduling and Selectionmentioning
confidence: 99%
See 1 more Smart Citation
“…The scalability of such general implementations is therefore often not satisfactory. To overcome the limitations of general implementations, two possible approaches take advantage of special properties of either the embedded RK method [23] or the ODE system to be solved [29,31]. In the following, we follow the second approach of taking advantage of special properties of the ODE system and investigate the scalability of data-parallel implementations of embedded RK methods.…”
Section: Introductionmentioning
confidence: 99%
“…In a shared memory system we only have to set a flag that the data is ready for the next process and no data transfer must take pace. The pipelining technique is described in [10]. Here we aim to automatically detect pipelining possibilities in the total task graph containing both the solver stages and the right hand side of the system, and automatically generate parallelized code optimized for the specific latency and bandwidth parameters of the target machine.…”
Section: Combining Parallelization At Several Levelsmentioning
confidence: 99%