Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Lee, Phil; Ro, Won Woo; Kim, Keun Soo; Koo, Gunjae; Yoon, Myung Kuk; Annavaram, Murali

doi:10.1109/hpca.2016.7446062

Cited by 34 publications

(9 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare HAWS scheduling agaist a Greedy-than-oldest (GTO, our baseline) [26], CTA-aware scheduling [9], and warped-preexecution scheduling techniques [12]. Figure 12 shows the overall performance achieved by HAWS and the competing techniques for both memory intensive and non-memory intensive applications, as compared to the baseline GPU model.…”

Section: Performancementioning

confidence: 99%

“…Gong et al [4] proposed TwinKernels, which takes advantages of the different instruction scheduling algorithms in the compiler to improve overlap of compute and memory operations. Kim et al [12] proposed a warped-preexecution approach on GPUs. In this technique, wavefronts try to issue future instructions that are independent of the stalling instructions.…”

Section: Related Work Using Hints In Microprocessorsmentioning

confidence: 99%

See 1 more Smart Citation

Haws

Gong

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing efficient thread preemption. But given the demands on the memory hierarchy due to the growth in the number of computing cores on-chip, it has become increasingly difficult to hide all of these stalls.In this article, we propose a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls. HAWS starts by enhancing a compiler infrastructure to identify potential opportunities that can bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 14.6% on average for memory intensive applications.

show abstract

Section: Performancementioning

confidence: 99%

Section: Related Work Using Hints In Microprocessorsmentioning

confidence: 99%

Haws

Gong

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Once the pending memory (or compute) operation completes, the data dependency is resolved and the warp is allowed to resume execution. Since floatingpoint operation latencies are fairly small, the majority of data hazards are caused by pending loads [17]. When all warps are de-scheduled due to data hazards, which is often the case in memory-intensive applications, the core is forced to stall.…”

Section: A Implications Of Congestionmentioning

confidence: 99%

“…Kim et al [17] proposed pre-execution of independent instructions in a warp to minimize the impact of data and structural dependencies. Similarly, Sethia et al [2] proposed a re-execution queue to reduce the L1 cache hit latencies in presence of structural hazards.…”

Section: Related Workmentioning

confidence: 99%

Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs

Dublish

Nagarajan

Topham

2017

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

Abstract-GPUs are often limited by off-chip memory bandwidth. With the advent of general-purpose computing on GPUs, a cache hierarchy has been introduced to filter the bandwidth demand to the off-chip memory. However, the cache hierarchy presents its own bandwidth limitations in sustaining such high levels of memory traffic.In this paper, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs for generalpurpose applications. We quantify the stalls throughout the memory hierarchy and identify the architectural parameters that play a pivotal role in leading to a congested memory system. We explore the architectural design space to mitigate the bandwidth bottlenecks and show that performance improvement achieved by mitigating the bandwidth bottleneck in the cache hierarchy can exceed the speedup obtained by a memory system with a baseline cache hierarchy and High Bandwidth Memory (HBM) DRAM. We also show that addressing the bandwidth bottleneck in isolation at specific levels can be sub-optimal and can even be counter-productive. Therefore, we show that it is imperative to resolve the bandwidth bottlenecks synergistically across different levels of the memory hierarchy. With the insights developed in this paper, we perform a cost-benefit analysis and identify costeffective configurations of the memory hierarchy that effectively mitigate the bandwidth bottlenecks. We show that our final configuration achieves a performance improvement of 29% on average with a minimal area overhead of 1.6%.

show abstract

“…If a warp is stalled by a data dependency or long latency memory access, then warp schedulers issue another ready warp from the warp pool so that the execution of warps is interleaved [42]. The availability of stall hiding relies on the number of eligible warps in the warp pool, which is the primary reason why GPUs require a large number of concurrent threads [45]. Here, we use TLP to quantify the proportion of active warps in a SM.…”

Section: The Impact Of Higher Tlpmentioning

confidence: 99%

RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Fang

Chen

Xiong

et al. 2017

Preprint

View full text Add to dashboard Cite

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things computing. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSN. Then, using the CUDA Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.

show abstract

Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Cited by 34 publications

References 39 publications

Haws

Haws

Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs

RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Contact Info

Product

Resources

About