Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques 2013
DOI: 10.1109/pact.2013.6618813
|View full text |Cite
|
Sign up to set email alerts
|

Reshaping cache misses to improve row-buffer locality in multicore systems

Abstract: General-purpose graphics processing units (GPG-PUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs). CTAs are groups of threads and can be executed in any order, thereby providing ample opportunities for TLP. The state-of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(3 citation statements)
references
References 29 publications
(38 reference statements)
0
3
0
Order By: Relevance
“…Prior studies on the impact of TLP on throughput-oriented processors, including References [14,18,27], are discussed in Section 4.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Prior studies on the impact of TLP on throughput-oriented processors, including References [14,18,27], are discussed in Section 4.…”
Section: Related Workmentioning
confidence: 99%
“…The inclusion of multiple levels of caches complicates the relationship between TLP and the overall performance. As studied in prior works [14,18,27], high degrees of TLP cause the cache to suffer from the contention problem, which may lead to performance degradation. For example, Figure 1 shows the impact of cache thrashing with various degrees of TLP.…”
Section: Introductionmentioning
confidence: 99%
“…Chen et al [18] proposed a novel warp scheduling algorithm that flexibly uses the time slice round-robin feature to utilize GPU parallelism. Jog et al [19] and Kayiran et al [20] proposed CTA-aware warp scheduling algorithms to reduce cache and memory contention or improve thread-level parallelism. Rogers et al [4] analyzed how hardware scheduler influences the management to GPU cache and proposed a cache sensitive warp scheduling policy.…”
Section: Related Workmentioning
confidence: 99%