An adaptive performance modeling tool for GPU architectures

Baghsorkhi, Sara S.; Delahaye, Matthieu; Patel, Sanjay J.; Gropp, William; Hwu, Wen-mei W.

doi:10.1145/1693453.1693470

Cited by 197 publications

(105 citation statements)

References 15 publications

Supporting

Mentioning

103

Contrasting

Unclassified

Order By: Relevance

“…The value, ITILP, models the possibility of inter-thread instruction-level parallelism in GPGPUs. The concept of ITILP was introduced in Baghsorkhi et al [11]. In particular, instructions may issue from multiple warps on a GPGPU; thus, we consider global ILP (i.e., ILP among warps) rather than warp-local ILP (i.e., the ILP of one warp).…”

Section: Execution Time Modelingmentioning

confidence: 99%

“…To provide a formal framework to study this problem, Baghsorkhi et al introduced the concept of balanced GPGPU computation [11]. This model represents a GPGPU computation using the computation carried by an average warp.…”

Section: Other Performance Modeling Techniques and Tools 61mentioning

confidence: 99%

“…To identify bottlenecks that result in an imbalanced computation, Baghsorkhi et al combined the amount of available concurrency in a kernel (TLP, DLP, and ILP)-considering the effect of coexisting threads-and the latencies of the SIMD pipeline and the memory system such that the interactive effects of different performance factors are preserved [11]. For this task, they introduced the work flow graph (WFG).…”

Section: Work Flow Graphsmentioning

confidence: 99%

See 2 more Smart Citations

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

Abts¹,

Kim²

2011

Synthesis Lectures on Computer Architecture

View full text Add to dashboard Cite

Section: Execution Time Modelingmentioning

confidence: 99%

Section: Other Performance Modeling Techniques and Tools 61mentioning

confidence: 99%

Section: Work Flow Graphsmentioning

confidence: 99%

See 1 more Smart Citation

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

Abts¹,

Kim²

2011

Synthesis Lectures on Computer Architecture

View full text Add to dashboard Cite

“…GPU performance modeling has been tackled in some valuable research works [1,5,20], but none of them deals with data transfers between CPU and GPU and the use of streams. To the best of our knowledge, there is only one research work focused on CUDA streams performance [8].…”

Section: Introductionmentioning

confidence: 99%

Performance models for asynchronous data transfers on consumer Graphics Processing Units

Gómez-Luna

González-Linares

Benavides

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…Acceleration of applications by GPU shared memory is developed on single GPU [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22], multi-GPU [23] and GPU cluster [24][25][26]. Since the capacity of GPU shared memory is restricted, accessing them by multiple threads often leads to the problem of bank conflict, which is one of key factors to deteriorate the performance of CUDA kernels.…”

mentioning

confidence: 99%

Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computing

Zhao

2015

J Supercomput

View full text Add to dashboard Cite

Mobile GPU applications usually constrain by the real-time requirement. However, FLOPS of mobile GPU is limited by the size and power supply of the SoC systems. Same to desktop GPUs, the mobile GPU consists of an on-chip memory hierarchy, and proper usage of memory hierarchy accelerates mobile GPU applications such as Discrete Wavelet Transform (DWT) to satisfy the real-time requirement. In this paper, by taking advantage of GPU shared memory in Tegra K1, a mobile GPU from Nvidia, we develop Bank Conflict Free Shared Memory Parallel DWT for mobile GPU applications. Computational results show that, with the display resolution of 640 × 350 (EGA), Bank Conflict Free Shared Memory Parallel DWT is significantly faster than SoC CPU-based DWT. Computational results also show that, with the display resolution of 320 × 200 (CGA), 640 × 480 (VGA), 800 × 600 (SVGA) and 1024 × 768 (XGA), Bank Conflict Free Shared Memory Parallel DWT can generally satisfy the real-time requirement.

show abstract

An adaptive performance modeling tool for GPU architectures

Cited by 197 publications

References 15 publications

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

Performance models for asynchronous data transfers on consumer Graphics Processing Units

Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computing

Contact Info

Product

Resources

About