2012 Symposium on Application Accelerators in High Performance Computing 2012
DOI: 10.1109/saahpc.2012.12
|View full text |Cite
|
Sign up to set email alerts
|

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

Abstract: General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the exe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0
1

Year Published

2014
2014
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 34 publications
(21 citation statements)
references
References 18 publications
0
18
0
1
Order By: Relevance
“…Wende et al [8] demonstrate a CPU-GPU parallelization scheme on the GLAT molecular thermodynamics code. Similar to our work, their approach extracts parallelism from different loops for execution on CPU and GPU cores.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Wende et al [8] demonstrate a CPU-GPU parallelization scheme on the GLAT molecular thermodynamics code. Similar to our work, their approach extracts parallelism from different loops for execution on CPU and GPU cores.…”
Section: Related Workmentioning
confidence: 99%
“…Some researchers have already begun examining this question. In particular, Wende et al [8] demonstrate a CPU-GPU parallelization scheme for a molecular thermodynamics code, called GLAT. The authors observe that the GLAT code processes two different types of molecules, all of which can be performed in parallel.…”
Section: Introductionmentioning
confidence: 99%
“…, m. The question remains, how to sufficiently cover the relevant part of X with an initial discretization. One could apply the existing methods for a good initial sampling of X (ConCoord, 16 GLAT, 17 taboo search, 18 or continuation methods 19 ). Alternatively, the above picking algorithm could be used (and will be used in the numerical example) to "fill" X: After we have constructed the basis functions Φ k , we perform the restraint simulations according to the penalty potentials U k .…”
Section: Constructing An Initial Discretizationmentioning
confidence: 99%
“…Wende et al proposes a reordering scheme of kernel invocations [14]. As opposed to our scheme, they target concurrent execution of small-scale multiple kernels on a single device.…”
Section: Related Workmentioning
confidence: 99%