Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture 2011
DOI: 10.1145/2155620.2155656
|View full text |Cite
|
Sign up to set email alerts
|

Improving GPU performance via large warps and two-level warp scheduling

Abstract: Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational res… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
248
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 355 publications
(249 citation statements)
references
References 23 publications
1
248
0
Order By: Relevance
“…Several recent works focus on bandwidth compression to decrease memory traffic by transmitting data in a compressed form in both CPUs [17], [24], [3] and GPUs [21], [17], [26], which results in better system performance and energy consumption. Bandwidth compression proves to be particularly effective in GPUs because GPUs are often bottlenecked by memory bandwidth [15], [14], [13], [28], [26]. GPU applications also exhibit high degrees of data redundancy [21], [17], [26], leading to good compression ratios.…”
Section: Introductionmentioning
confidence: 99%
“…Several recent works focus on bandwidth compression to decrease memory traffic by transmitting data in a compressed form in both CPUs [17], [24], [3] and GPUs [21], [17], [26], which results in better system performance and energy consumption. Bandwidth compression proves to be particularly effective in GPUs because GPUs are often bottlenecked by memory bandwidth [15], [14], [13], [28], [26]. GPU applications also exhibit high degrees of data redundancy [21], [17], [26], leading to good compression ratios.…”
Section: Introductionmentioning
confidence: 99%
“…Compared to the resources shown in Table 1, the hardware overhead of our proposed approach, including the TB dispatcher logic and the per-SM 40-bit workload buffer, is nearly negligible. The baseline warp scheduling policy is round robin (RR) and the two-level warp scheduling policy [17] is examined in our design space exploration in Section 6.3. In our design space exploration, we also vary the register file size and the SIMD width to evaluate the effectiveness of our approach in different configurations.…”
Section: Experimental Methodologymentioning
confidence: 99%
“…[6,7]. Recent work has focussed on two-level warp scheduling to reduce the impact of memory latency [4,8]. Although we not address control flow, we note that an ideal scheduler takes both aspects (data-locality and control flow) into account.…”
Section: Related Workmentioning
confidence: 99%