Cache oblivious parallelograms in iterative stencil computations

Strzodka, Robert; Shaheen, Mohammed; Pająk, Dawid; Seidel, Hans‐Peter

doi:10.1145/1810085.1810096

Cited by 45 publications

(37 citation statements)

References 18 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Strzodka et al [17] used time skewing and cache-size oblivious parallelograms to improve the memory system pressure and parallelism in stencils on CPUs. Micikevicius et al [15] handtuned a 3-D finite difference computation stencil and achieved an order of magnitude performance increase over existing CPU implementations on GT200-based Tesla GPUs.…”

Section: Related Workmentioning

confidence: 99%

A stencil compiler for short-vector SIMD architectures

Henretty

Veras

Franchetti

et al. 2013

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

110

View full text Add to dashboard Cite

Stencil computations are an integral component of applications in a number of scientific computing domains. Short-vector SIMD instruction sets are ubiquitous on modern processors and can be used to significantly increase the performance of stencil computations. Traditional approaches to optimizing stencils on these platforms have focused on either short-vector SIMD or data locality optimizations. In this paper, we propose a domain-specific language and compiler for stencil computations that allows specification of stencils in a concise manner and automates both locality and short-vector SIMD optimizations, along with effective utilization of multi-core parallelism. Loop transformations to enhance data locality and enable load-balanced parallelism are combined with a data layout transformation to effectively increase the performance of stencil computations. Performance increases are demonstrated for a number of stencils on several modern SIMD architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

A stencil compiler for short-vector SIMD architectures

Henretty

Veras

Franchetti

et al. 2013

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

110

View full text Add to dashboard Cite

show abstract

“…A number of recent studies have focused on optimizing stencil computations for multicore CPUs and GPUs [2,6,8,12,17,19,20]. Strzodka et al [17] use time skewing and cache-size oblivious parallelograms to improve the memory system pressure and parallelism in stencils on CPUs.…”

Section: Related Workmentioning

confidence: 99%

“…A number of recent studies have focused on optimizing stencil computations on multicore CPUs [2,6,8,17,19] as well as GPUs [11][12][13].…”

Section: Introductionmentioning

confidence: 99%

High-performance code generation for stencil computations on GPU architectures

Holewinski

Pouchet

Sadayappan

2012

Proceedings of the 26th ACM International Conference on Supercomputing

212

156

View full text Add to dashboard Cite

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these architectures offer challenges for developers and compilers alike. Stencil computations in particular require careful attention to off-chip memory access and the balancing of work among compute units in GPU devices.In this paper, we present a code generation scheme for stencil computations on GPU accelerators, which optimizes the code by trading an increase in the computational workload for a decrease in the required global memory bandwidth. We develop compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation. We show that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.

show abstract

“…Song and Li proposed similar techniques in [4]. [3] proposed a cache-oblivious algorithm to compute stencil computations for the same purpose of temporal reuse. [6] proposed circular queue, which uses a special data structure to hold temporary values.…”

Section: Related Workmentioning

confidence: 99%

“…Otherwise, memory bandwidth may become a serious performance bottleneck. Thus, a lot of prior work exploits this kind of reuse across multiple time steps [1][2][3][4][5] .…”

Section: Introductionmentioning

confidence: 99%

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Yang

Cui

Feng

et al. 2012

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers eAEectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four diAEerent types of stencils on three diAEerent GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

show abstract

Cache oblivious parallelograms in iterative stencil computations

Cited by 45 publications

References 18 publications

A stencil compiler for short-vector SIMD architectures

A stencil compiler for short-vector SIMD architectures

High-performance code generation for stencil computations on GPU architectures

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Contact Info

Product

Resources

About