A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines

Ma, Lin; Chamberlain, Roger D.

doi:10.1109/asap.2012.19

Cited by 21 publications

(14 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The model presented in [19], which we will extend in the subsection that follows, characterizes algorithm performance in terms of the following factors: algorithmic complexity, f app , caching, f cache , and scheduling, f sched . The algorithmic complexity factor is expressed via a function…”

Section: B Calibrated Modeling Of Runtimementioning

confidence: 99%

“…If B r > B a × P/Q, multiple passes are needed to consume all the requested blocks of work. From [19], the number of active blocks B a is described in equation (4) in terms of the shared memory required by the application S B , the quantity of shared memory on each multiprocessor Z, the number of threads requested per thread block T r , the processor registers required by the application R T ×T r , the quantity of registers available per multiprocessor R, the maximally allowed thread blocks B max , and the maximally allowed threads T maxM P .…”

Section: B Calibrated Modeling Of Runtimementioning

confidence: 99%

“…In the present work, we utilize both asymptotic analysis and calibrated performance prediction on many-core GPUs, in effect drawing on the concepts of [10] and [19]. We develop an integrated analytical framework combining both, analyzing algorithm efficiency and predicting the achievable execution time based on a quantification of parallelism, latency-hiding, and occupancy.…”

Section: Introductionmentioning

confidence: 99%

“…These four metrics help to identify performance bottlenecks and suggest what types of optimizations should be done. Ma et al [2], [19] design an analytic model for memory-limited kernels especially as impacted by cache and various configuration parameters that can be used to tune kernel execution. This model also reflects warp scheduling effect at the thread block level rather than instruction level, thus it is simpler than the model in [18].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Performance modeling for highly-threaded many-core GPUs

Chamberlain

Agrawal

2014

2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors

Self Cite

View full text Add to dashboard Cite

Highly-threaded many-core GPUs can provide high throughput for a wide range of algorithms and applications. Such machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The achieved performance, therefore, depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). In this paper, we extend previously proposed analytical models, jointly addressing parallelism, latency-hiding, and occupancy. In particular, the model not only helps to explore and reduce the configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The model is validated with empirical experiments. In addition, the model points to at least one circumstance in which the occupancy decisions automatically made by the scheduler are clearly sub-optimal in terms of runtime.

show abstract

Section: B Calibrated Modeling Of Runtimementioning

confidence: 99%

Section: B Calibrated Modeling Of Runtimementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Performance modeling for highly-threaded many-core GPUs

Chamberlain

Agrawal

2014

2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors

Self Cite

View full text Add to dashboard Cite

show abstract

“…A number of highperformance GPU algorithms have been developed, such as sorting [1], hashing [2], dynamic programming [3], graph algorithms [4], and other classic algorithms [5]. Many performance studies have also been conducted [6], [7] to understand the performance of GPU applications.…”

Section: Introductionmentioning

confidence: 99%

Analysis of classic algorithms on GPUs

Chamberlain

Agrawal

2014

2014 International Conference on High Performance Computing &Amp; Simulation (HPCS)

Self Cite

View full text Add to dashboard Cite

Abstract-The recently developed Threaded Many-core Memory (TMM) model provides a framework for analyzing algorithms for highly-threaded many-core machines such as GPUs. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The TMM model analysis contains two components: computational complexity and memory complexity.A model is only useful if it can explain and predict empirical data. In this work, we investigate the effectiveness of the TMM model. We analyze algorithms for 5 classic problems -suffix tree/array for string matching, fast Fourier transform, merge sort, list ranking, and all-pairs shortest paths -under this model, and compare the results of the analysis with the experimental findings of ours and other researchers who have implemented and measured the performance of these algorithms on an spectrum of diverse GPUs. We find that the TMM model is able to predict important and sometimes previously unexplained trends and artifacts in the experimental data.

show abstract