Improving GPU performance via large warps and two-level warp scheduling

Narasiman, Veynu; Shebanow, Michael; Lee, Chang Joo; Miftakhutdinov, Rustam; Mutlu, Onur; Patt, Yale N.

doi:10.1145/2155620.2155656

Cited by 355 publications

(249 citation statements)

References 23 publications

Supporting

Mentioning

248

Contrasting

Order By: Relevance

“…Several recent works focus on bandwidth compression to decrease memory traffic by transmitting data in a compressed form in both CPUs [17], [24], [3] and GPUs [21], [17], [26], which results in better system performance and energy consumption. Bandwidth compression proves to be particularly effective in GPUs because GPUs are often bottlenecked by memory bandwidth [15], [14], [13], [28], [26]. GPU applications also exhibit high degrees of data redundancy [21], [17], [26], leading to good compression ratios.…”

Section: Introductionmentioning

confidence: 99%

Toggle-Aware Compression for GPUs

Pekhimenko

Bolotin

O’Connor

et al. 2015

IEEE Comput. Arch. Lett.

Self Cite

View full text Add to dashboard Cite

Abstract-Memory bandwidth compression can be an effective way to achieve higher system performance and energy efficiency in modern data-intensive applications by exploiting redundancy in data. Prior works studied various data compression techniques to improve both capacity (e.g., of caches and main memory) and bandwidth utilization (e.g., of the on-chip and off-chip interconnects). These works addressed two common shortcomings of compression: (i) compression/decompression overhead in terms of latency, energy, and area, and (ii) hardware complexity to support variable data size. In this paper, we make the new observation that there is another important problem related to data compression in the context of the communication energy efficiency: transferring compressed data leads to a substantial increase in the number of bit toggles (communication channel switchings from 0 to 1 or from 1 to 0). This, in turn, increases the dynamic energy consumed by on-chip and off-chip buses due to more frequent charging and discharging of the wires. Our results, for example, show that the bit toggle count increases by an average of 2.2× with some compression algorithms across 54 mobile GPU applications. We characterize and demonstrate this new problem across a wide variety of 221 GPU applications and six different compression algorithms. To mitigate the problem, we propose two new toggle-aware compression techniques: Energy Control and Metadata Consolidation. These techniques greatly reduce the bit toggle count impact of the six data compression algorithms we examine, while keeping most of their bandwidth reduction benefits.

show abstract

Section: Introductionmentioning

confidence: 99%

Toggle-Aware Compression for GPUs

Pekhimenko

Bolotin

O’Connor

et al. 2015

IEEE Comput. Arch. Lett.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Compared to the resources shown in Table 1, the hardware overhead of our proposed approach, including the TB dispatcher logic and the per-SM 40-bit workload buffer, is nearly negligible. The baseline warp scheduling policy is round robin (RR) and the two-level warp scheduling policy [17] is examined in our design space exploration in Section 6.3. In our design space exploration, we also vary the register file size and the SIMD width to evaluate the effectiveness of our approach in different configurations.…”

Section: Experimental Methodologymentioning

confidence: 99%

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Xiang

Yang²,

Zhou

2014

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…[6,7]. Recent work has focussed on two-level warp scheduling to reduce the impact of memory latency [4,8]. Although we not address control flow, we note that an ideal scheduler takes both aspects (data-locality and control flow) into account.…”

Section: Related Workmentioning

confidence: 99%

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

Nugteren

Braak

Corporaal

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.

show abstract

Improving GPU performance via large warps and two-level warp scheduling

Cited by 355 publications

References 23 publications

Toggle-Aware Compression for GPUs

Toggle-Aware Compression for GPUs

Warp-level divergence in GPUs: Characterization, impact, and mitigation

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

Contact Info

Product

Resources

About