2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) 2016
DOI: 10.1109/hpca.2016.7446062
|View full text |Cite
|
Sign up to set email alerts
|

Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(9 citation statements)
references
References 39 publications
0
9
0
Order By: Relevance
“…We compare HAWS scheduling agaist a Greedy-than-oldest (GTO, our baseline) [26], CTA-aware scheduling [9], and warped-preexecution scheduling techniques [12]. Figure 12 shows the overall performance achieved by HAWS and the competing techniques for both memory intensive and non-memory intensive applications, as compared to the baseline GPU model.…”
Section: Performancementioning
confidence: 99%
See 1 more Smart Citation
“…We compare HAWS scheduling agaist a Greedy-than-oldest (GTO, our baseline) [26], CTA-aware scheduling [9], and warped-preexecution scheduling techniques [12]. Figure 12 shows the overall performance achieved by HAWS and the competing techniques for both memory intensive and non-memory intensive applications, as compared to the baseline GPU model.…”
Section: Performancementioning
confidence: 99%
“…Gong et al [4] proposed TwinKernels, which takes advantages of the different instruction scheduling algorithms in the compiler to improve overlap of compute and memory operations. Kim et al [12] proposed a warped-preexecution approach on GPUs. In this technique, wavefronts try to issue future instructions that are independent of the stalling instructions.…”
Section: Related Work Using Hints In Microprocessorsmentioning
confidence: 99%
“…Once the pending memory (or compute) operation completes, the data dependency is resolved and the warp is allowed to resume execution. Since floatingpoint operation latencies are fairly small, the majority of data hazards are caused by pending loads [17]. When all warps are de-scheduled due to data hazards, which is often the case in memory-intensive applications, the core is forced to stall.…”
Section: A Implications Of Congestionmentioning
confidence: 99%
“…Kim et al [17] proposed pre-execution of independent instructions in a warp to minimize the impact of data and structural dependencies. Similarly, Sethia et al [2] proposed a re-execution queue to reduce the L1 cache hit latencies in presence of structural hazards.…”
Section: Related Workmentioning
confidence: 99%
“…If a warp is stalled by a data dependency or long latency memory access, then warp schedulers issue another ready warp from the warp pool so that the execution of warps is interleaved [42]. The availability of stall hiding relies on the number of eligible warps in the warp pool, which is the primary reason why GPUs require a large number of concurrent threads [45]. Here, we use TLP to quantify the proportion of active warps in a SM.…”
Section: The Impact Of Higher Tlpmentioning
confidence: 99%