A performance analysis framework for identifying potential benefits in GPGPU applications

Sim, Jaewoong; Dasgupta, Aniruddha; Kim, Hyesoon; Vuduc, Richard

doi:10.1145/2370036.2145819

Cited by 48 publications

(21 citation statements)

References 19 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to the memory requirements of each stage, information about the computational characteristics of each stage is required. The estimated runtime could be inferred from hardware performance models [178,145,28,9]. Approaches as those described in Section 4.1.2 can be applied to estimate the suitability of different agent update stages for execution on a certain accelerator.…”

Section: Computational Profilingmentioning

confidence: 99%

A Survey on Agent-based Simulation Using Hardware Accelerators

et al. 2019

View full text Add to dashboard Cite

Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorization of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

show abstract

Section: Computational Profilingmentioning

confidence: 99%

A Survey on Agent-based Simulation Using Hardware Accelerators

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Since very little information about the underlying GPU architecture is disclosed, it becomes very unlikely to build accurate simulators for each new GPU generation. Luckily, the results [6,10,17,20] show that we can have very good approximation of GPU performance using analytical approaches. However existing GPU performance models all rely on certain level of an application's implementation (C++ code, PTX code, assembly code.…”

Section: Introductionmentioning

confidence: 95%

“…Bakhoda et al [5] developed a detailed GPU simulator and the simulator also uses the PTX code as input. Recently, Sim et al [17] extended the MWP-CWP model and utilize the assembly code of CUDA kernel to predict performance. The quantitative GPU performance model proposed by Zhang and Owens [20] is also based on the native assembly code.…”

Section: Introductionmentioning

confidence: 99%

“…The roofline model [19] is well known for estimating the optimization effects. The recent work by Sim et al [17] studied the effects of different optimization techniques on GPUs using the similar approach as the roofline model. However, the chosen optimizations normally rely on the initial code version and different optimizations are likely to have complex impacts on each other.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Lai

Seznec

2013

Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

In this paper, we present an approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedulers are the main limitation factors for SGEMM to approach the theoretical peak performance. The estimated upper-bound peak performance of SGEMM is around 82.5% of the theoretical peak performance on GTX580 Fermi GPU and 57.6% on GTX680 Kepler GPU. Guided by this analysis and using the native assembly language, on average, our SGEMM implementations achieve about 5% better performance than CUBLAS in CUDA 4.1 SDK for large matrices on GTX580. The achieved performance is around 90% of the estimated upper-bound performance of SGEMM on GTX580. On GTX680, the best performance we achieve is around 77.3% of the estimated performance upper bound. We also describe how to use native assembly language directly in the CUDA runtime source code.

show abstract

“…Here, we explain a model based on the approach of Sim et al [85]. Much performance modeling has been done for both CPUs and GPUs.…”

Section: Instruction-level Analysis and Tuningmentioning

confidence: 99%

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

Kim¹,

Vuduc²,

Baghsorkhi³

et al. 2012

Synthesis Lectures on Computer Architecture

Self Cite

View full text Add to dashboard Cite

General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and program≠ming models. We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a con≠nection to GPGPU platforms. We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also brie˚y survey the state-of-the-art in GPU performance analysis tools and techniques. General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and program≠ming models. We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a con≠nection to GPGPU platforms. We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also brie˚y survey the state-of-the-art in GPU performance analysis tools and techniques. General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Re...

show abstract

A performance analysis framework for identifying potential benefits in GPGPU applications

Cited by 48 publications

References 19 publications

A Survey on Agent-based Simulation Using Hardware Accelerators

A Survey on Agent-based Simulation Using Hardware Accelerators

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

Contact Info

Product

Resources

About