2011 International Conference on High Performance Computing &Amp; Simulation 2011
DOI: 10.1109/hpcsim.2011.5999886
|View full text |Cite
|
Sign up to set email alerts
|

Understanding the impact of CUDA tuning techniques for Fermi

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

2012
2012
2016
2016

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 36 publications
(15 citation statements)
references
References 4 publications
0
15
0
Order By: Relevance
“…In Fermi architecture, when L1 cache memory is active, the size of a global memory transaction is 128 bytes. If L1 cache memory is not active, it is 32 bytes . With noncoalesced memory access, part of the data brought from memory are unused, so the bandwidth is not being effectively used.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In Fermi architecture, when L1 cache memory is active, the size of a global memory transaction is 128 bytes. If L1 cache memory is not active, it is 32 bytes . With noncoalesced memory access, part of the data brought from memory are unused, so the bandwidth is not being effectively used.…”
Section: Methodsmentioning
confidence: 99%
“…If L1 cache memory is not active, it is 32 bytes. 33 With noncoalesced memory access, part of the data brought from memory are unused, so the bandwidth is not being effectively used. If this bandwidth waste is mitigated, then a certain improvement of the performance is expected.…”
Section: Advanced Strategy 1: Allocation Of 1 Individual Per Thread Omentioning
confidence: 99%
“…The factors affecting the occupancy are the thread block size, shared memory used by each thread block, and the registers used by each thread. Tuning thread block size can have a significant effect on performance [24]. Therefore, we operate towards the target of keeping the occupancy of the new kernel as high as possible by tuning the thread block size.…”
Section: Tuning Thread Block Sizementioning
confidence: 99%
“…In addition, these studies usually change the programs themselves, while our work attempts to analyze memory behaviors of given programs. One study [17] provides a limited observation of GPU cache impact on a handful of simple kernels. In contrast, our work provides a more systematic characterization of GPU cache effectiveness and uses that to develop an algorithm for automating the choice of how and when to use demand-fetched caches.…”
Section: Related Workmentioning
confidence: 99%