Proceedings of the 36th Annual International Symposium on Computer Architecture 2009
DOI: 10.1145/1555754.1555775
|View full text |Cite
|
Sign up to set email alerts
|

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Abstract: GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
95
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 403 publications
(98 citation statements)
references
References 18 publications
3
95
0
Order By: Relevance
“…Threads in one block cannot communicate with the threads in the other block as they may be scheduled at different times. This architecture implies that for any job to be run on GPU, it has to be broken into blocks of computation that can run independently without communicating with each other [32]. These blocks will have to be further broken down into smaller tasks that execute on an individual thread that may communicate with other threads in the same block.…”
Section: Vertical Scaling Platformsmentioning
confidence: 99%
“…Threads in one block cannot communicate with the threads in the other block as they may be scheduled at different times. This architecture implies that for any job to be run on GPU, it has to be broken into blocks of computation that can run independently without communicating with each other [32]. These blocks will have to be further broken down into smaller tasks that execute on an individual thread that may communicate with other threads in the same block.…”
Section: Vertical Scaling Platformsmentioning
confidence: 99%
“…CUDA programs on the host (CPU) invoke a kernel which runs on the device (GPU). All threads within a block are executed concurrently on a ar-chitecture [14]. In addition, when a multiprocessor is given one or more thread blocks to execute, it partitions them into groups of 32 parallel threads termed warp.…”
Section: Preliminariesmentioning
confidence: 99%
“…Guo et al [9] showed a performance modeling and optimizing analysis to predict and optimize SpMV performance on GPUs. A simple analytical GPU model to predict the execution time of massively parallel programs was given by Hong et al [14]. Schaa et al [15] presented a model to accurately estimate the execution time of GPU applications by varying the configurations.…”
Section: Introductionmentioning
confidence: 99%
“…[15] evaluated the different placements of memory controller in many-core CMPs. Full system was simulated to test the communication performance in order to find an optimal placement.…”
Section: Related Workmentioning
confidence: 99%