Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Ryoo, Shane; Rodrigues, Christopher; Baghsorkhi, Sara S.; Stone, Sam S.; Kirk, David B.; Hwu, Wen-mei W.

doi:10.1145/1345206.1345220

Cited by 663 publications

(407 citation statements)

References 13 publications

Supporting

Mentioning

395

Contrasting

Unclassified

Order By: Relevance

“…Because many programs contain loops, we perform a value analysis to determine loop bounds (if possible). The value analysis is also used to analyze memory access patterns, which have a significant impact on performance on GPUs [26].…”

Section: Static Code Feature Extractionmentioning

confidence: 99%

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

Grewe

O’Boyle

2011

Lecture Notes in Computer Science

152

124

View full text Add to dashboard Cite

Abstract. Heterogeneous multi-core platforms are increasingly prevalent due to their perceived superior performance over homogeneous systems. The best performance, however, can only be achieved if tasks are accurately mapped to the right processors. OpenCL programs can be partitioned to take advantage of all the available processors in a system. However, finding the best partitioning for any heterogeneous system is difficult and depends on the hardware and software implementation.We propose a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems. We develop a purely static approach based on predictive modelling and program features. When evaluated over a suite of 47 benchmarks, our model achieves a speedup of 1.57 over a state-of-the-art dynamic run-time approach, a speedup of 3.02 over a purely multi-core approach and 1.55 over the performance achieved by using just the GPU.

show abstract

Section: Static Code Feature Extractionmentioning

confidence: 99%

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

Grewe

O’Boyle

2011

Lecture Notes in Computer Science

152

124

View full text Add to dashboard Cite

show abstract

“…CUDA-enabled GPU architecture is memory-bound architecture, so reasonable data layout on CUDA and memory optimization is critical for performance improvement [12,11].…”

Section: Optimizationmentioning

confidence: 99%

Accelerating GOR Algorithm Using CUDA

Gan¹,

liu²,

Wang³

et al. 2013

Appl. Math. Inf. Sci.

View full text Add to dashboard Cite

Protein secondary structure prediction is very important for its molecular structure. GOR algorithm is one of the most successful computational methods and has been widely used as an efficient analysis tool to predict secondary structure from protein sequence. However, the running time is unbearable with sharp growth in protein database. Fortunately, CUDA (Compute Unified Device Architecture) provides a promising approach to accelerate secondary structure prediction. Therefore, we propose a fine-grained parallel implementation to parallelize GOR-IV package for accelerating protein secondary structure prediction, in which each amino acid would be assigned to one single CUDA thread, hence protein secondary structure prediction would be parallelized by many CUDA threads simultaneously, and constant cache is resorted to cache parameter table. Experimental results show a speedup factor is more than 173X over original GOR-IV version.

show abstract

“…GPUs are tuned for data parallelism, implementing the SIMD (Single Instruction -Multiple Data) processing model, allowing the execution of thousands of threads in parallel. GPUs have proven to be extremely efficient with matrix-style computations [9], providing a convincing speed-up of 2-3 orders of magnitude.…”

Section: Performance Considerationsmentioning

confidence: 99%

Applying Search in an Automatic Contract-Based Testing Tool

Kolesnichenko

Poskitt

Meyer

2013

Search Based Software Engineering

View full text Add to dashboard Cite

Abstract. Automated random testing has been shown to be effective at finding faults in a variety of contexts and is deployed in several testing frameworks. AutoTest is one such framework, targeting programs written in Eiffel, an object-oriented language natively supporting executable pre-and postconditions; these respectively serving as test filters and test oracles. In this paper, we propose the integration of search-based techniques-along the lines of Tracey-to try and guide the tool towards input data that leads to violations of the postconditions present in the code; input data that random testing alone might miss, or take longer to find. Furthermore, we propose to minimise the performance impact of this extension by applying GPU programming to amenable parts of the computation.

show abstract

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Cited by 663 publications

References 13 publications

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

Accelerating GOR Algorithm Using CUDA

Applying Search in an Automatic Contract-Based Testing Tool

Contact Info

Product

Resources

About