High-performance code generation for stencil computations on GPU architectures

Holewinski, Justin; Pouchet, Louis-Noël; Sadayappan, P.

doi:10.1145/2304576.2304619

Cited by 207 publications

(158 citation statements)

References 16 publications

(38 reference statements)

Supporting

Mentioning

156

Contrasting

Unclassified

Order By: Relevance

“…Even though mappings alternative to row-or column-major can produce a more favorable memory access stream for some applications without CMS [27,55,32,67], the baseline still cannot outperform CMS even with complex data layout transformations. CMS will also benefit communication-avoiding optimizations which tradeoff reduced memory traffic for redundant computation [51,28,18]. CMS can increase performance without the extra local storage, cache space, or computations needed for redundant communication, while better alleviating network congestion and reducing memory power.…”

Section: Discussionmentioning

confidence: 99%

“…However, processors may block and wait for others to become ready. Because barrier calls are typical in computation loops [28], synchronous reads introduce no additional waiting and can replace barrier calls.…”

Section: Read Operationsmentioning

confidence: 99%

“…Past work has repeatedly reported that a wide variety of applications are constrained by memory bandwidth [65,28,64,59,36,18,46,12,32,55]. In those cases, while local and last-level caches can eliminate DRAM accesses during the computation phase of a loop, data is still retrieved from main memory when loading new and storing old HTAs, which is the focus of CMS.…”

Section: Related Workmentioning

confidence: 99%

“…Last-level caches can partially reconstruct address order for writes with a write back policy. However, streaming (write-through) writes are preferable to write back policies in stencil-based computations to avoid polluting higher-level caches because the results of a computation loop are not reused in the next iteration [18,28]. Even with a write-back policy, caches are constrained by their size and the unpredictability of the incoming packers, similar to a memory controller.…”

Section: Related Workmentioning

confidence: 99%

“…Memory bandwidth is not scaling rapidly enough to satisfy the increasing number of processors, making the performance of a wide variety of applications constrained by memory bandwidth [70,66,59,12,32,18,19,28,30,2,55]. In fact, current projections state that chip pins increase by 10% every year whereas on-chip processors double every 18 months [59].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Section: Read Operationsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

PSkel: A stencil programming framework for CPU‐GPU systems

Pereira

Ramos

Góes

2015

Concurrency and Computation

View full text Add to dashboard Cite

The use of Graphics Processing Units (GPUs) for high-performance computing has gained growing momentum in recent years. Unfortunately, GPU-programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high-performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU-GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high-level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high-level abstraction for stencil programming on heterogeneous CPU-GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU-only and GPU-only parallel applications, respectively.A common approach to address the CPU-GPU programming complexity is the use of algorithmic skeletons. Parallel skeletons model and abstract common parallel programming patterns (computation and coordination phases), thereby enabling the programmer to focus on algorithm design, rather than on runtime system details. Among existing parallel skeletons, the stencil pattern is critical in many scientific computing domains, including image and signal processing and computational fluid dynamics [3,4]. The large body of recent work targeting GPU implementations of high-performance stencil computations stresses the importance of that pattern [5][6][7][8].Another important aspect of CPU-GPU platforms is that their runtime systems generally fail to exploit the platform's full potential for parallel processing. Specifically, the runtime systems do not partition the work (computations and data) of parallel applications across CPUs and GPUs to increase their utilization. For that reason, many existing frameworks have runtime systems that enable either static or dynamic task partitioning [5,[9][10][11][12][13]. However, those frameworks either fail to provide high-level abstractions, support only multi-GPU systems, or do not partition tasks to both CPU and GPU simultaneously. The aforementioned observations prompt for systems that can both exploit task partitioning efficiently and provide high-level abstractions for CPU-GPU programming.In this paper, we propose and evaluate PSkel (Parallel Skeletons), a framework for stencil programming in heterogeneous CPU-GPU systems. PSkel ...

show abstract

Demystifying the 16 × 16 thread‐block for stencils on the GPU

Tabik

Peemen

Guil

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Document VersionPublisher's PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:• A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.• You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. SUMMARYStencil computation is of paramount importance in many fields, in image processing, structural biology and biomedicine, among others. There exists a permanent demand of maximizing the performance of stencils on state-of-the-art architectures, such graphics processing units (GPUs). One of the important issues when optimizing these kernels for the GPU is the selection of the best thread-block that maximizes the overall performance. Usually, programmers look for the optimal thread-block configuration in a reduced space of square thread-block configurations or simply use the best configurations reported in previous works, which is usually 16 16. This paper provides a better understanding of the impact of thread-block configurations on the performance of stencils on the GPU. In particular, we model locality and parallelism and consider that the optimal configurations are within the space that provides: (1) a small number of global memory communications; (2) a good shared memory utilization with small numbers of conflicts; (3) a good streaming multi-processors utilization; and (4) a high efficiency of the threads within a thread-block. The model determines the set of optimal thread-block configurations without the need of executing the code. We validate the proposed model using six stencils with different halo widths and show that it reduces the optimization space to around 25% of the total valid space. The configurations in this space achieve at least a throughput of 75% of ...

show abstract

High-performance code generation for stencil computations on GPU architectures

Cited by 207 publications

References 16 publications

Collective Memory Transfers for Multi-Core Chips

Collective Memory Transfers for Multi-Core Chips

PSkel: A stencil programming framework for CPU‐GPU systems

Demystifying the 16 × 16 thread‐block for stencils on the GPU

Contact Info

Product

Resources

About