A performance prediction model for the CUDA GPGPU platform

Kothapalli, Kishore; Mukherjee, Rishabh; Rehman, M. Suhail; Patidar, Suryakant; Narayanan, P. J.; Srinathan, Kannan

doi:10.1109/hipc.2009.5433179

Cited by 79 publications

(48 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Zhang and Owens [14] adopted a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: instruction pipeline, shared memory access, and global memory access. Their model focuses on identifying performance bottlenecks and guiding programmers for optimization; our model focuses on predicting the execution time, which is similar to [15]- [17]. Baghsorkhi et al [15] presented a compiler-based GPU performance modeling approach with accurate prediction using program analysis and symbolic evaluation techniques.…”

Section: Related Workmentioning

confidence: 99%

“…Their model estimates the number of parallel memory requests by taking into account the number of running threads and memory bandwidth. Kothapalli et al [17] presented a performance model by combining several known models of parallel computation: BSP, PRAM, and QRQW. However, their proposed analytical models are based on the abstraction of GPU architecture.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Guo¹,

Wang²

2012

2012 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Abstract-This paper presents an integrated analytical and profile-based CUDA performance modeling approach to accurately predict the kernel execution times of sparse matrix-vector multiplication for CSR, ELL, COO, and HYB SpMV CUDA kernels. Based on our experiments conducted on a collection of 8 widely-used testing matrices on NVIDIA Tesla C2050, the execution times predicted by our model match the measured execution times of NVIDIA's SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Guo¹,

Wang²

2012

2012 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

show abstract

“…Kothapalli et al [20] have presented a combination of known models with small extensions. The models they have used are: BSP model, PRAM model by Fortune and Wylie [21] and the QRQW model by Gibbons [2].…”

Section: It Is Common To Present the Parameters Of The Bsp Model As Amentioning

confidence: 99%

A Simple BSP-based Model to Predict Execution Time in GPU Applications

Amarís

Cordeiro

Goldman

et al. 2015

2015 IEEE 22nd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

Abstract-Models are useful to represent abstractions of software and hardware processes. The Bulk Synchronous Parallel (BSP) is a bridging model for parallel computation that allows algorithmic analysis of programs on parallel computers using performance modeling. The main idea of BSP model is the treatment of communication and computation as abstractions of a parallel system. Meanwhile, the use of GPU devices are becoming more widespread and they are currently capable of performing efficient parallel computation for applications that can be decomposed on thousands of simple threads. However, few models for predicting application execution time on GPUs have been proposed.In this work we present a simple and intuitive BSP-based model for predicting the CUDA application execution times on GPUs. The model is based on the number of computations and memory accesses of the GPU, with additional information on cache usage obtained from profiling. Scalability, divergence, effect of optimizations and differences of architectures are adjusted by a single parameter. We evaluated our model using two applications and six different boards. We showed by using profile information for a single board, that the model is general enough to predict the execution time of an application with different input sizes and on different boards with the same architecture. Our model predictions were within 0.8 to 1.2 times the measured execution times, which are reasonable for such a simple model. These results indicate that the model is good enough to generalize the predictions for different problem sizes and GPU configurations.

show abstract

“…However, algorithms developed on the models do not always show good performance on GPUs because the PRAM models are substantially different from actual GPU architectures. For estimating the performance of GPUbased algorithms, several models have been proposed [5,6,7,8]. Hong et al [9] and Kothapalli et al [5] have proposed analytical models to estimate actual running time of GPU-based algorithms without executing applications on GPUs.…”

Section: Introductionmentioning

confidence: 99%

“…For estimating the performance of GPUbased algorithms, several models have been proposed [5,6,7,8]. Hong et al [9] and Kothapalli et al [5] have proposed analytical models to estimate actual running time of GPU-based algorithms without executing applications on GPUs. Ma et al [7] and Nakano [8] have proposed memory access models that take memory access latency into account.…”

Section: Introductionmentioning

confidence: 99%

A Novel Computational Model for GPUs with Applications to Efficient Algorithms

Koike

Sadakane

2015

IJNC

View full text Add to dashboard Cite

We propose a novel computational model for GPUs. Known parallel computational models such as the PRAM model are not appropriate for evaluating GPU-based algorithms. Our model, called AGPU, abstracts the essence of current GPU architectures such as global and shared memory, memory coalescing and bank conflicts. Using our model, we can evaluate asymptotic behavior of GPU algorithms more efficiently than the known models and we can develop algorithms that run fast on real GPU devices.As a showcase, we analyze the asymptotic behavior of basic existing algorithms including reduction, prefix scan, and comparison sorting. We further develop new algorithms by detecting and resolving performance bottlenecks of the existing algorithms. Our reduction algorithm has the optimal time and I/O complexities and works with non-commutative operators. Our comparison sorting algorithm has the optimal I/O complexity. Additionally, we show our algorithms run faster than the existing algorithms not only in theory but also in practice.

show abstract

A performance prediction model for the CUDA GPGPU platform

Cited by 79 publications

References 17 publications

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Accurate CUDA performance modeling for sparse matrix-vector multiplication

A Simple BSP-based Model to Predict Execution Time in GPU Applications

A Novel Computational Model for GPUs with Applications to Efficient Algorithms

Contact Info

Product

Resources

About