Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Yoon, Myung Kuk; Kim, Keun Soo; Lee, Phil; Ro, Won Woo; Annavaram, Murali

doi:10.1109/isca.2016.59

Cited by 39 publications

(27 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…F smem of Table II in Section V-A). This agrees to prior work's analysis [21], [22]. Exploiting such unused shared memory space, we propose to redirect memory requests of severely interfering warps to the unused shared memory space.…”

Section: B Ciao On-chip Memory Architecturesupporting

confidence: 88%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality. Consequently, cache locality-aware warp scheduling may undesirably increase cache interference and/or unnecessarily decrease TLP.In this paper, we propose Cache Interference-Aware throughput-Oriented (CIAO) on-chip memory architecture and warp scheduling which exploit unused shared memory space and take insight opposite to cache locality-aware warp scheduling. Specifically, CIAO on-chip memory architecture can adaptively redirect memory requests of severely interfering warps to unused shared memory space to isolate memory requests of these interfering warps from those of interfered warps. If these interfering warps still incur severe cache interference, CIAO warp scheduling then begins to selectively throttle execution of these interfering warps. Our experiment shows that CIAO can offer 54% higher performance than prior cache locality-aware scheduling at a small chip cost.

show abstract

Section: B Ciao On-chip Memory Architecturesupporting

confidence: 88%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

show abstract

“…In GPUs, the number of blocks a SM can serve at a time is limited due to capacity and scheduling limits. Authors in [11] suggest that number of blocks are limited mostly due to scheduling limits rather than resource constraints. So, they…”

Section: Thread-level Parallelismmentioning

confidence: 99%

Throughput optimization and resource allocation on GPUs under multi-application execution

Punyala

Marinakis

Komaee

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

“…Both these studies adjust the number of active warps to improve performance. The work in [42] analyzes the usage of computing resources and memory resources for different applications, and it simulates the maximum TLP and exploits the underutilized resources as much as possible. ILP is presented in the work in [43], which proposes to build SCs for GPU-like many-core processors to achieve both high performance and high energy efficiency.…”

Section: Two-level Parallelism Optimization Modelmentioning

confidence: 99%

“…If a warp is stalled by a data dependency or long latency memory access, then warp schedulers issue another ready warp from the warp pool so that the execution of warps is interleaved [42]. The availability of stall hiding relies on the number of eligible warps in the warp pool, which is the primary reason why GPUs require a large number of concurrent threads [45].…”

Section: The Impact Of Higher Tlpmentioning

confidence: 99%

“…Figure 7(a) presents the execution order with ILP0, and we assume for illustrative purposes that each block can issue four warps concurrently at this moment. After issuing a series of instructions from each warp in a round-robin order, each warp hits a global memory access stall [42], whereas in Figure 7(b), the value of ILP1 is twice as large as ILP0, so the number of instructions of each thread per warp/block is doubled. In view of the fact that the total number of tasks has not changed, the total number of threads per block are halved.…”

Section: Execution Order Simulationmentioning

confidence: 99%

See 1 more Smart Citation

RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Fang

Chen

Xiong

et al. 2017

Preprint

View full text Add to dashboard Cite

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things computing. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSN. Then, using the CUDA Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.

show abstract

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Cited by 39 publications

References 50 publications

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Throughput optimization and resource allocation on GPUs under multi-application execution

RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Contact Info

Product

Resources

About