Performance modeling for highly-threaded many-core GPUs

Ma, Lin; Chamberlain, Roger D.; Agrawal, Kunal

doi:10.1109/asap.2014.6868641

Cited by 18 publications

(7 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The large memory bandwidth can also be used to hide memory latency. The achieved performance, therefore, depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy) [63]. Based on the results in Table 6, observe that our CUDA implementation exhibits better performance and a shorter runtime on the Titan than on the GTX 960.…”

Section: Evaluation 2: Performances On Different Gpusmentioning

confidence: 92%

GPU Acceleration of the Most Apparent Distortion Image Quality Assessment Algorithm

et al. 2018

View full text Add to dashboard Cite

The primary function of multimedia systems is to seamlessly transform and display content to users while maintaining the perception of acceptable quality. For images and videos, perceptual quality assessment algorithms play an important role in determining what is acceptable quality and what is unacceptable from a human visual perspective. As modern image quality assessment (IQA) algorithms gain widespread adoption, it is important to achieve a balance between their computational efficiency and their quality prediction accuracy. One way to improve computational performance to meet real-time constraints is to use simplistic models of visual perception, but such an approach has a serious drawback in terms of poor-quality predictions and limited robustness to changing distortions and viewing conditions. In this paper, we investigate the advantages and potential bottlenecks of implementing a best-in-class IQA algorithm, Most Apparent Distortion, on graphics processing units (GPUs). Our results suggest that an understanding of the GPU and CPU architectures, combined with detailed knowledge of the IQA algorithm, can lead to non-trivial speedups without compromising prediction accuracy. A single-GPU and a multi-GPU implementation showed a 24× and a 33× speedup, respectively, over the baseline CPU implementation. A bottleneck analysis revealed the kernels with the highest runtimes, and a microarchitectural analysis illustrated the underlying reasons for the high runtimes of these kernels. Programs written with optimizations such as blocking that map well to CPU memory hierarchies do not map well to the GPU’s memory hierarchy. While compute unified device architecture (CUDA) is convenient to use and is powerful in facilitating general purpose GPU (GPGPU) programming, knowledge of how a program interacts with the underlying hardware is essential for understanding performance bottlenecks and resolving them.

show abstract

Section: Evaluation 2: Performances On Different Gpusmentioning

confidence: 92%

GPU Acceleration of the Most Apparent Distortion Image Quality Assessment Algorithm

et al. 2018

View full text Add to dashboard Cite

show abstract

“…A new model by Ma, Chamberlain & Agrawal (2014b) has recently been suggested for analyzing the complexities of parallel algorithms on graphics processors. This model is obtained from the combination of asymptotic and calibrated models.…”

Section: Complexity Analysismentioning

confidence: 99%

Enhancing the performance of the aggregated bit vector algorithm in network packet classification using GPU

Abbasi

Tahouri

Rafiee

2019

PeerJ Computer Science

View full text Add to dashboard Cite

Packet classification is a computationally intensive, highly parallelizable task in many advanced network systems like high-speed routers and firewalls that enable different functionalities through discriminating incoming traffic. Recently, graphics processing units (GPUs) have been exploited as efficient accelerators for parallel implementation of software classifiers. The aggregated bit vector is a highly parallelizable packet classification algorithm. In this work, first we present a parallel kernel for running this algorithm on GPUs. Next, we adapt an asymptotic analysis method which predicts any empirical result of the proposed kernel. Experimental results not only confirm the efficiency of the proposed parallel kernel but also reveal the accuracy of the analysis method in predicting important trends in experimental results.

show abstract

“…These four metrics help to identify performance bottlenecks. Ma et al [29] design an analysis framework for many-core architectures, bridging the gap between asymptotic models and calibrated models that quantitatively predict runtime. The framework jointly addresses parallelism, latency-hiding, and occupancy; helps to reduce the configuration space for tuning kernels; and reflects performance trends as the problem size and other parameters scale.…”

Section: Algorithmsmentioning

confidence: 99%

Analysis of classic algorithms on GPUs

Chamberlain

Agrawal

2014

2014 International Conference on High Performance Computing &Amp; Simulation (HPCS)

Self Cite

View full text Add to dashboard Cite

Abstract-The recently developed Threaded Many-core Memory (TMM) model provides a framework for analyzing algorithms for highly-threaded many-core machines such as GPUs. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The TMM model analysis contains two components: computational complexity and memory complexity.A model is only useful if it can explain and predict empirical data. In this work, we investigate the effectiveness of the TMM model. We analyze algorithms for 5 classic problems -suffix tree/array for string matching, fast Fourier transform, merge sort, list ranking, and all-pairs shortest paths -under this model, and compare the results of the analysis with the experimental findings of ours and other researchers who have implemented and measured the performance of these algorithms on an spectrum of diverse GPUs. We find that the TMM model is able to predict important and sometimes previously unexplained trends and artifacts in the experimental data.

show abstract

Performance modeling for highly-threaded many-core GPUs

Cited by 18 publications

References 26 publications

GPU Acceleration of the Most Apparent Distortion Image Quality Assessment Algorithm

GPU Acceleration of the Most Apparent Distortion Image Quality Assessment Algorithm

Enhancing the performance of the aggregated bit vector algorithm in network packet classification using GPU

Analysis of classic algorithms on GPUs

Contact Info

Product

Resources

About