CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Yang, Yi; Li, Chao; Zhou, Huiyang

doi:10.1007/s11390-015-1500-y

Cited by 29 publications

(45 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…EffiSha gives no explicit treatment to dynamic parallelism, for two reasons. First, dynamic parallelism on GPU is rarely used in practice due to its large overhead (as much as 60X) [10,11]. Second, a recent work [10] shows that dynamic parallelism in a kernel can be automatically replaced with thread reuse through free launch transformations, and yields large speedups.…”

Section: Memory Size and Dynamic Parallelismmentioning

confidence: 99%

EffiSha

et al. 2017

Self Cite

View full text Add to dashboard Cite

Modern GPUs are broadly adopted in many multitasking environments, including data centers and smartphones. However, the current support for the scheduling of multiple GPU kernels (from different applications) is limited, forming a major barrier for GPU to meet many practical needs. This work for the first time demonstrates that on existing GPUs, efficient preemptive scheduling of GPU kernels is possible even without special hardware support. Specifically, it presents EffiSha, a pure software framework that enables preemptive scheduling of GPU kernels with very low overhead. The enabled preemptive scheduler offers flexible support of kernels of different priorities, and demonstrates significant potential for reducing the average turnaround time and improving the system overall throughput of programs that time share a modern GPU.

show abstract

Section: Memory Size and Dynamic Parallelismmentioning

confidence: 99%

EffiSha

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…CUDA's dynamic parallelism (or OpenCL's device-side enqueue) lets threads already in flight create new groups of threads [36]. This feature gives developers the opportunity to implement strikingly elegant algorithms [21].…”

Section: Introductionmentioning

confidence: 99%

Function Call Re-Vectorization

Moreira¹,

Collange

Pereira³

2017

SIGPLAN Not.

View full text Add to dashboard Cite

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by lowlevel SIMD programming on explicit vectors. To close this expressiveness gap while preserving performance, this paper introduces the notion of Call Re-Vectorization (CREV). CREV allows changing the dimension of vectorization during the execution of a kernel, exposing it as a nested parallel kernel call. CREV affords programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this paper, we present a formal semantics of CREV, and an implementation of it on the ISPC compiler. We have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum effort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. Thus, CREV gives developers the elegance of dynamic programming, and the performance of explicit SIMD programming.

show abstract

“…For example, [1]shows that dynamic parallelism can improve the performance of some clustering algorithms. [2] gives a solution to the existence of parallel code in the thread of the method, and hope that this method can be used instead of CUDA's dynamic parallelism.…”

Section: Introductionmentioning

confidence: 99%

“…The first is the father of the kernel communication must pass global memory; the second is in the GPU side to start the Kernel's overhead is very large [2]. The official whitepaper gives more real-world descriptions of what a GPU's computing power can do, rather than how components work together.…”

Section: Introductionmentioning

confidence: 99%

NK-GPGPU A GPGPU model for nested kernels

Xing¹,

Hu²,

Che³

2017

Proceedings of the 2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 201

View full text Add to dashboard Cite

Abstract. More and more scientific problems are now using GPGPU to solve. However , the existing GPGPU did not give a good solution for the problems such as recursive calls and multiple calls problems. Our proposed NKGPGPU model is a descriptive model that solves the key issues of the NK-GPGPU, including hardware architecture, task organization, task execution, and task scheduling. We also prove the validity of our model by optimizing an existing GPGPU performance prediction model.

show abstract

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Cited by 29 publications

References 43 publications

EffiSha

EffiSha

Function Call Re-Vectorization

NK-GPGPU A GPGPU model for nested kernels

Contact Info

Product

Resources

About