Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Zohouri, Hamid Reza; Podobas, Artur; Matsuoka, Satoshi

doi:10.1145/3174243.3174248

Cited by 71 publications

(74 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Loop collapsing to reduce area overhead of storing variable and buffer states in multiply-nested loops • Padding relative to the degree of temporal parallelism to reduce unaligned accesses caused by overlapped blocking that result in memory bandwidth waste Complete details of our implementation and the performance model we use for parameter tuning are discussed in [8].…”

Section: A Base Implementation For First-order Stencilsmentioning

confidence: 99%

“…To extend out base implementation from [8] for high-order stencil computation, multiple modifications were required:…”

Section: B Extension For High-order Stencilsmentioning

confidence: 99%

“…Since stencil radius had already been considered in our performance model in [8], no further changes were required in the model to support high-order stencils.…”

Section: B Extension For High-order Stencilsmentioning

confidence: 99%

See 2 more Smart Citations

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Zohouri

Podobas

Matsuoka

2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and onchip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-theart YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.

show abstract

Section: A Base Implementation For First-order Stencilsmentioning

confidence: 99%

“…To extend out base implementation from [8] for high-order stencil computation, multiple modifications were required:…”

Section: B Extension For High-order Stencilsmentioning

confidence: 99%

See 1 more Smart Citation

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Zohouri

Podobas

Matsuoka

2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, a stencil accelerator using Intel OpenCL was presented in the work of Zohouri et al, where the authors highlight two important concerns about the compiler. First, due the Partial Reconfiguration on Arria 10 FPGAs, the fitting and routing quality for OpenCL is reduced on Arria 10.…”

Section: Related Workmentioning

confidence: 99%

“…One way is to focus on application and domain‐specific accelerators: Neural network, Bayesian learning, bioinformatics, stencil computing, energy‐efficient accelerators for graph analytics algorithms, and irregular applications mapping . Another way is to focus on Domain‐Specific Language (DSL) which aims representing parallelism in stream‐based applications, like SPar based on C++ …”

Section: Related Workmentioning

confidence: 99%

ADD: Accelerator Design and Deploy ‐ A tool for FPGA high‐performance dataflow computing

Penha

Silva

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Dataflow‐based FPGA accelerators have become a promising alternative to deliver energy‐efficient high‐performance computing. However, FPGA programming is still a challenge. This paper presents Accelerator Design and Deploy (ADD), a high‐level framework to specify, to simulate, and to implement dataflow accelerators for streaming applications. The framework includes an open dataflow operator library, and templates are provided to easily design new operators. The framework also provides a high‐level and an accurate simulation at circuit level with short execution times. Moreover, ADD provides software and hardware APIs to simplify the integration process, extending the benefits of portability from low‐cost FPGA boards to high performance datacenter FPGA platforms. Our framework supports coupling with high‐level programming languages, and it has been validated on two FPGA platforms: the Intel high‐performance CPU‐FPGA heterogeneous computing platform and an educational FPGA kit. We show that our simple approach presents competitive performance, both in time and energy, when compared to multi‐core and GPU accelerators.

show abstract