2015
DOI: 10.1007/s11390-015-1500-y
|View full text |Cite
|
Sign up to set email alerts
|

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Abstract: Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynami… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
41
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 29 publications
(45 citation statements)
references
References 43 publications
1
41
0
Order By: Relevance
“…EffiSha gives no explicit treatment to dynamic parallelism, for two reasons. First, dynamic parallelism on GPU is rarely used in practice due to its large overhead (as much as 60X) [10,11]. Second, a recent work [10] shows that dynamic parallelism in a kernel can be automatically replaced with thread reuse through free launch transformations, and yields large speedups.…”
Section: Memory Size and Dynamic Parallelismmentioning
confidence: 99%
“…EffiSha gives no explicit treatment to dynamic parallelism, for two reasons. First, dynamic parallelism on GPU is rarely used in practice due to its large overhead (as much as 60X) [10,11]. Second, a recent work [10] shows that dynamic parallelism in a kernel can be automatically replaced with thread reuse through free launch transformations, and yields large speedups.…”
Section: Memory Size and Dynamic Parallelismmentioning
confidence: 99%
“…CUDA's dynamic parallelism (or OpenCL's device-side enqueue) lets threads already in flight create new groups of threads [36]. This feature gives developers the opportunity to implement strikingly elegant algorithms [21].…”
Section: Introductionmentioning
confidence: 99%
“…For example, [1]shows that dynamic parallelism can improve the performance of some clustering algorithms. [2] gives a solution to the existence of parallel code in the thread of the method, and hope that this method can be used instead of CUDA's dynamic parallelism.…”
Section: Introductionmentioning
confidence: 99%
“…The first is the father of the kernel communication must pass global memory; the second is in the GPU side to start the Kernel's overhead is very large [2]. The official whitepaper gives more real-world descriptions of what a GPU's computing power can do, rather than how components work together.…”
Section: Introductionmentioning
confidence: 99%