On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

Wende, Florian; Cordes, Frank; Steinke, Thomas

doi:10.1109/saahpc.2012.12

Cited by 34 publications

(21 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Wende et al [8] demonstrate a CPU-GPU parallelization scheme on the GLAT molecular thermodynamics code. Similar to our work, their approach extracts parallelism from different loops for execution on CPU and GPU cores.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Gerzhoy

Sun

Zuzak

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Heterogeneous microprocessors integrate a CPU and GPU on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data "in place." This permits exploiting a finer granularity of parallelism on the integrated GPUs, and enables the use of GPUs for accelerating more complex and irregular codes. One challenge, however, is exposing enough parallelism such that both the CPU and GPU are effectively utilized to achieve maximum gain. In this article, we propose exploiting nested parallelism for integrated CPU-GPU chips. We look for loop structures in which one or more regular data parallel loops are nested within a parallel outer loop that can contain irregular code (e.g., with control divergence). By scheduling the outer loop on multiple CPU cores, multiple dynamic instances of the inner regular loop(s) can be scheduled on the GPU cores. This boosts GPU utilization and parallelizes the outer loop. We find that such nested MIMD-SIMD parallelization provides greater levels of parallelism for integrated CPU-GPU chips, and additionally there is ample opportunity to perform such parallelization in OpenMP programs. Our results show nested MIMD-SIMD parallelization provides a 16.1x and 8.67x speedup over sequential execution on a simulator and a physical machine, respectively. Our technique beats CPU-only parallelization by 4.13x and 2.40x, respectively, and GPU-only parallelization by 2.74x and 2.26x, respectively. Compared to the next-best scheme (either CPU-or GPU-only parallelization) per benchmark, our approach provides a 1.46x and 1.23x speedup for the simulator and physical machine, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Some researchers have already begun examining this question. In particular, Wende et al [8] demonstrate a CPU-GPU parallelization scheme for a molecular thermodynamics code, called GLAT. The authors observe that the GLAT code processes two different types of molecules, all of which can be performed in parallel.…”

Section: Introductionmentioning

confidence: 99%

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Gerzhoy

Sun

Zuzak

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…, m. The question remains, how to sufficiently cover the relevant part of X with an initial discretization. One could apply the existing methods for a good initial sampling of X (ConCoord, 16 GLAT, 17 taboo search, 18 or continuation methods 19 ). Alternatively, the above picking algorithm could be used (and will be used in the numerical example) to "fill" X: After we have constructed the basis functions Φ k , we perform the restraint simulations according to the penalty potentials U k .…”

Section: Constructing An Initial Discretizationmentioning

confidence: 99%

Set-free Markov state model building

Weber

Fackeldey

Schütte

2017

The Journal of Chemical Physics

View full text Add to dashboard Cite

Molecular dynamics (MD) simulations face challenging problems since the time scales of interest often are much longer than what is possible to simulate; and even if sufficiently long simulations are possible the complex nature of the resulting simulation data makes interpretation difficult. Markov State Models (MSMs) help to overcome these problems by making experimentally relevant time scales accessible via coarse grained representations that also allow for convenient interpretation. However, standard set-based MSMs exhibit some caveats limiting their approximation quality and statistical significance. One of the main caveats results from the fact that typical MD trajectories repeatedly re-cross the boundary between the sets used to build the MSM which causes statistical bias in estimating the transition probabilities between these sets. In this article, we present a set-free approach to MSM building utilizing smooth overlapping ansatz functions instead of sets and an adaptive refinement approach. This kind of meshless discretization helps to overcome the recrossing problem and yields an adaptive refinement procedure that allows us to improve the quality of the model while exploring state space and inserting new ansatz functions into the MSM.

show abstract

“…Wende et al proposes a reordering scheme of kernel invocations [14]. As opposed to our scheme, they target concurrent execution of small-scale multiple kernels on a single device.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Task Scheduling Scheme for a GPGPU Programming Framework

Ohno

Yamamoto

Tanaka

2016

IJNC

View full text Add to dashboard Cite

The computational power and the physical memory size of a single GPU device are often insufficient for large-scale problems. Using CUDA, the user must explicitly partition such problems into several tasks repeating the data transfers and kernel executions. To use multiple GPUs, explicit device switching is also needed. Furthermore, low-level hand optimizations such as load balancing and determining task granularity are required to achieve high performance. To handle large-scale problems without any additional user code, we introduce an implicit dynamic task scheduling scheme to our CUDA variation MESI-CUDA. MESI-CUDA is designed to abstract the low-level GPU features; virtual shared variables and logical thread mappings hide the complex memory hierarchy and physical characteristics. On the other hand, explicit parallel execution using kernel functions is the same as in CUDA. In our scheme, each kernel invocation in the user code is translated into a job submission to the runtime scheduler. The scheduler partitions a job into tasks considering the device memory size and dynamically schedules them to the available GPU devices. Thus the user can simply specify kernel invocations independent of the execution environment. The evaluation result shows that our scheme can automatically utilize heterogeneous GPU devices with small overhead.

show abstract

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

Cited by 34 publications

References 18 publications

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Set-free Markov state model building

Dynamic Task Scheduling Scheme for a GPGPU Programming Framework

Contact Info

Product

Resources

About