Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

Wu, Bo; Chen, Guoyang; Liu, Dong; Shen, Xipeng; Vetter, Jeffrey S.

doi:10.1145/2751205.2751213

Cited by 80 publications

(24 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work explicitly finds occupancy bounds across a range of GPUs in Section 5.1. Many instances of previous work (which use the persistent thread model) also acknowledge this bound [5,6,11,13,19,21,25,[30][31][32], adding to the empirical evidence for this execution model.…”

Section: Occupancy-bound Execution Modelmentioning

confidence: 98%

Portable inter-workgroup barrier synchronisation for GPUs

Sorensen

Donaldson

Batty

et al. 2016

Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applicatio

View full text Add to dashboard Cite

Section: Occupancy-bound Execution Modelmentioning

confidence: 98%

Portable inter-workgroup barrier synchronisation for GPUs

Sorensen

Donaldson

Batty

et al. 2016

Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applicatio

View full text Add to dashboard Cite

“…In the literature, a large collection of program transformations to improve performance of GPU applications is available, see e.g. [2,13,17,20]. We argue that such transformations should be specified formally, and whenever the transformation is applied to the code, the corresponding specifications also should be transformed, in such a way that the resulting program can be verified again (provided the original program could be verified).…”

Section: Correctness and Compiler Optimisationsmentioning

confidence: 99%

Program Correctness by Transformation

Huisman

Blom²,

Darabi

et al. 2018

Leveraging Applications of Formal Methods, Verification and Validation. Modeling

View full text Add to dashboard Cite

Deductive program verification can be used effectively to verify high-level programs, but can be challenging for low-level, highperformance code. In this paper, we argue that compilation and program transformations should be made annotation-aware, i.e. during compilation and program transformation, not only the code should be changed, but also the corresponding annotations. As a result, if the original highlevel program could be verified, also the resulting low-level program can be verified. We illustrate this approach on a concrete case, where loop annotations that capture possible loop parallelisations are translated into specifications of an OpenCL kernel that corresponds to the parallel loop. We also sketch how several commonly used OpenCL kernel transformations can be adapted to also transform the corresponding program annotations. Finally, we conclude the paper with a list of research challenges that need to be addressed to further develop this approach.

show abstract

“…This allows GPU resources to be simultaneously shared among kernels. However, if this feature is not effectively employed, some resources might remain underutilized while a kernel is running [28]. Therefore, it is beneficial to allocate these unused resources to other kernels.…”

Section: Gpu Architecture and Programming Modelsmentioning

confidence: 99%

Metric Selection for GPU Kernel Classification

Shekofteh

Noori

Naghibzadeh

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are vastly used for running massively parallel programs. GPU kernels exhibit different behavior at runtime and can usually be classified in a simple form as either "compute-bound" or "memory-bound." Recent GPUs are capable of concurrently running multiple kernels, which raises the question of how to most appropriately schedule kernels to achieve higher performance. In particular, coscheduling of compute-bound and memory-bound kernels seems promising. However, its benefits as well as drawbacks must be determined along with which kernels should be selected for a concurrent execution. Classifying kernels can be performed online by instrumentation based on performance counters. This work conducts a thorough analysis of the metrics collected from various benchmarks from Rodinia and CUDA SDK. The goal is to find the minimum number of effective metrics that enables online classification of kernels with a low overhead. This study employs a wrapper-based feature selection method based on the Fisher feature selection criterion. The results of experiments show that to classify kernels with a high accuracy, only three and five metrics are sufficient on a Kepler and a Pascal GPU, respectively. The proposed method is then utilized for a runtime scheduler. The results show an average speedup of 1.18× and 1.1× compared with a serial and a random scheduler, respectively.

show abstract

Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

Cited by 80 publications

References 51 publications

Portable inter-workgroup barrier synchronisation for GPUs

Portable inter-workgroup barrier synchronisation for GPUs

Program Correctness by Transformation

Metric Selection for GPU Kernel Classification

Contact Info

Product

Resources

About