A quantitative performance analysis model for GPU architectures

Zhang, Yao; Owens, John D.

doi:10.1109/hpca.2011.5749745

Cited by 202 publications

(105 citation statements)

References 16 publications

Supporting

Mentioning

104

Contrasting

Order By: Relevance

“…Zhang et al [38] developed a microbenchmark-based performance model that allows programmers and architects to identify GPU program bottlenecks and predict the benefits of potential program optimizations and architectural improvements. Our work focuses on real GPU applications instead of microbenchmarks.…”

Section: G Discussionmentioning

confidence: 99%

BenchFriend

Che¹,

Skadron

2013

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Abstract-Graphics processing units (GPUs) have become an important platform for general-purpose computing, thanks to their high parallel throughput and high memory bandwidth. GPUs present significantly different architectures from CPUs and require specific mappings and optimizations to achieve high performance. This makes GPU workloads demonstrate application characteristics different from those of CPU workloads. It is critical for researchers to understand the first-order metrics that most influence GPU performance and scalability. Furthermore, methodologies and associated tools are needed to analyze and predict the performance of GPU applications and help guide users' purchasing decisions.In this work, we study an approach of predicting the performance of GPU applications by correlating them to existing workloads. One tenet of benchmark design, also a motivation of this paper, is that users should be given capabilities of leveraging standard workloads to infer the performance of applications of their interest. We first identify a set of important GPU application characteristics and then use them to predict performance of an arbitrary application by determining its most similar proxy benchmarks. We demonstrate the prediction methodology and conduct predictions with benchmarks from different suites to achieve better workload coverage. The experimental results show that we are able to achieve satisfactory performance predictions, although errors are higher for outlier applications. Finally, we discuss several considerations for systematically constructing future benchmark suites.

show abstract

Section: G Discussionmentioning

confidence: 99%

BenchFriend

Che¹,

Skadron

2013

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…Zhang et al [23] have presented a quantitative performance analysis model, based on micro-benchmarks for NVIDIA GeForce 200-series GPUs. They have developed a throughput model for three components of GPU execution time: the instruction pipeline, shared memory access, and global memory access.…”

Section: It Is Common To Present the Parameters Of The Bsp Model As Amentioning

confidence: 99%

A Simple BSP-based Model to Predict Execution Time in GPU Applications

Amarís

Cordeiro

Goldman

et al. 2015

2015 IEEE 22nd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

Abstract-Models are useful to represent abstractions of software and hardware processes. The Bulk Synchronous Parallel (BSP) is a bridging model for parallel computation that allows algorithmic analysis of programs on parallel computers using performance modeling. The main idea of BSP model is the treatment of communication and computation as abstractions of a parallel system. Meanwhile, the use of GPU devices are becoming more widespread and they are currently capable of performing efficient parallel computation for applications that can be decomposed on thousands of simple threads. However, few models for predicting application execution time on GPUs have been proposed.In this work we present a simple and intuitive BSP-based model for predicting the CUDA application execution times on GPUs. The model is based on the number of computations and memory accesses of the GPU, with additional information on cache usage obtained from profiling. Scalability, divergence, effect of optimizations and differences of architectures are adjusted by a single parameter. We evaluated our model using two applications and six different boards. We showed by using profile information for a single board, that the model is general enough to predict the execution time of an application with different input sizes and on different boards with the same architecture. Our model predictions were within 0.8 to 1.2 times the measured execution times, which are reasonable for such a simple model. These results indicate that the model is good enough to generalize the predictions for different problem sizes and GPU configurations.

show abstract

“…Recently, graphics cards or graphics processing units (GPU), introduced primarily for high-end gaming requiring high resolution, are now intensively being used, as a co-processor to the CPU, for general purpose computing [2,3]. The GPU itself is a multi-core processor having support for thousands of threads [4] running concurrently. GPUs are result of dozens of streaming processors with hundreds of core aligned in a particular way forming a single hardware unit.…”

Section: Introductionmentioning

confidence: 99%

“…The high powerful Quadro-6000 and GTX-260 is well suited for desktops with power requirement of 204W and 182W respectively. 4 . One difference between CUDA and OpenCL is that CUDA is specific for GPU devices whereas OpenCL is heterogeneous and targets all devices conforming its specification [5], [6].…”

Section: Introductionmentioning

confidence: 99%