High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Zohouri, Hamid Reza; Podobas, Artur; Matsuoka, Satoshi

doi:10.1109/ipdpsw.2018.00027

Cited by 17 publications

(17 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, 1.5D spatial blocking can be used for 2D stencils. This blocking technique has been widely employed on different devices [17,23,40,41], with not just two, but also more combined time-steps. In 2.5D spatial blocking, the 2D tiles are streamed over one dimension and data of each tile is effectively reused for updating the next tiles.…”

Section: Temporal Blockingmentioning

confidence: 99%

“…LIFT [11] is a functional data-parallel programming language that allows expressing stencil loops as a set of reusable parallel primitives and optimizing them. Recently, multiple implementations of N.5D blocking on FPGAs have also been proposed with FPGA-specific optimizations [4,5,40,41]. FPGAs tend to achieve better scaling with temporal blocking compared to GPUs due to higher flexibility of employing their on-chip memory which allows larger spatial block sizes compared to GPUs.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Kazuaki

Zohouri

Wahib

et al. 2020

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

Self Cite

View full text Add to dashboard Cite

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.CCS Concepts • Software and its engineering → Source code generation.

show abstract

Section: Temporal Blockingmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Kazuaki

Zohouri

Wahib

et al. 2020

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

Self Cite

View full text Add to dashboard Cite

show abstract

“…A halo size of zero results in simple sequential reading/writing with no overlapping. The second and third classes implement 1.5D and 2.5D overlapped spatial blocking, respectively, that are widely used in 2D and 3D stencil computation [5,6,7,8,9,10]. For the 1.5D class, the x dimension is blocked and memory accesses are streamed row by row until the last index in the y dimension, before moving to the next block ( Fig.…”

Section: A Memory Benchmark Suitementioning

confidence: 99%

“…The 1.5D and 2.5D blocking classes support all the above array configurations except R1W0. All Single Work-item kernels use collapsed loops with the exit condition optimization from [4,5,6] for best timing and hence, are constructed as a doubly-nested loops, with the fully-unrolled innermost loop having a trip count equal to the vector size, and the outer loop having an initiation interval of one. In the NDRange kernels, the workgroups have the same number of dimensions as the input, and memory access coalescing is performed using loop unrolling.…”

Section: Out Of Boundmentioning

confidence: 99%

“…It has been shown that due to the low external memory bandwidth and low byte to FLOP ratio of modern FPGAs, it is frequently not possible to fully utilize the compute performance of these devices for typical applications [1,4]. Moreover, even this limited external memory bandwidth cannot always be efficiently utilized due to limitations in the memory controller [4,5,6]. In this work, we will focus on analyzing the efficiency of the external memory controller on Intel FPGAs for OpenCL-based designs under different configurations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Zohouri

Matsuoka

2019

2019 IEEE/ACM International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC)

Self Cite

View full text Add to dashboard Cite

Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency of the memory controller of FPGAs is missing in literature, which becomes even more crucial when the limited memory bandwidth of modern FPGAs compared to their GPU counterparts is taken into account. In this work, we will analyze the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of overlapped blocking. Our results point to multiple shortcomings in the memory controller of Intel FPGAs, especially with respect to memory access alignment, that can hinder the programmer's ability in maximizing memory performance in their design. For some of these cases, we will provide work-arounds to improve memory bandwidth efficiency; however, a general solution will require major changes in the memory controller itself.

show abstract