Proceedings of the 23rd International Conference on Parallel Architectures and Compilation 2014
DOI: 10.1145/2628071.2628101
|View full text |Cite
|
Sign up to set email alerts
|

Warp-aware trace scheduling for GPUs

Abstract: GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch.We propose to circumvent these limitations by adapting T… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(9 citation statements)
references
References 27 publications
0
9
0
Order By: Relevance
“…Thus TLP and ILP are in opposition, and attaining full utilization requires carefully balancing both techniques. While TLP is commonly used across all of GPU computing, ILP is a less explored area, with prior work limited to dense linear algebra [22] and microcode optimization [23].…”
Section: Latency Hiding With Tlp and Ilpmentioning
confidence: 99%
“…Thus TLP and ILP are in opposition, and attaining full utilization requires carefully balancing both techniques. While TLP is commonly used across all of GPU computing, ILP is a less explored area, with prior work limited to dense linear algebra [22] and microcode optimization [23].…”
Section: Latency Hiding With Tlp and Ilpmentioning
confidence: 99%
“…A cl_context must be created which specifies on which device to run and also manages the resources on the device. All OpenCL work is performed within this are equipped with little or no branch prediction hardware like most CPUs [14]. As a result, it is crucial to write GPU kernels with as little branching as possible to maximize performance.…”
Section: Execution Modelmentioning
confidence: 99%
“…also propose a framework supporting a number of widely-used parallel patterns for efficient nested parallelism. [39] introduces warp-aware trace scheduling for GPUs based on speculating loads and arithmetic instructions upon divergence in order to exploit ILP. Recently, Schaub et.…”
Section: Related Work On Divergencementioning
confidence: 99%