Flexible FPGA design for FDTD using OpenCL

Kenter, Tobias; Förstner, Jens; Plessl, Christian

doi:10.23919/fpl.2017.8056844

Cited by 19 publications

(14 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use shift registers as on-chip buffers to take advantage of the regular memory access pattern in stencil computation. This is a well-known optimization that is employed in many deep-pipeline [9,20,22]. This optimization is not applicable to CPUs and GPUs due to lack of hardware support for this storage type.…”

Section: Spatial Blocking On Fpgasmentioning

confidence: 99%

“…We achieve this large performance advantage despite the fact that the Kintex-7 XC7Z045 FPGA they use has more DSPs and roughly half of the logic and Block RAM count of our Stratix V A7 FPGA. [1,9,20,22] present the recent high-performing deep-pipelined implementations of stencil computation on FPGAs, all of which avoid spatial blocking and hence, put hard limits on input dimensions relative to on-chip memory size. In contrast, we do employ spatial blocking to avoid such restrictions which limit usability in real-world HPC applications, and show that it is still possible to achieve high performance.…”

Section: Related Workmentioning

confidence: 99%

“…Previous work [1,9,20,22] have shown that FPGAs can achieve GPU-level performance in stencil computation. Most of such work achieve this level of performance by relying on temporal blocking without spatial blocking.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Zohouri

Podobas

Matsuoka

2018

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively. CCS CONCEPTS• Hardware → Reconfigurable logic and FPGAs; High-level and register-transfer level synthesis;

show abstract

Section: Spatial Blocking On Fpgasmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Zohouri

Podobas

Matsuoka

2018

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…We also employ temporal blocking to take advantage of the temporal locality of stencil computation by storing intermediate results of multiple iterations (time steps) on-chip, before finally writing them back to external memory. Unlike many previous studies on FPGAs [14][15][16][17], combining spatial and temporal blocking allows us to achieve high performance without restricting input size.…”

Section: A Base Implementation For First-order Stencilsmentioning

confidence: 99%

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Zohouri

Podobas

Matsuoka

2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and onchip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-theart YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.

show abstract

“…Examples of the latter are MD implementations on graphics processing units (GPUs) [Abraham et al 2015;Anderson et al 2008;Brown et al 2012;Colberg and Höfling 2011;Eastman and Pande 2010;Le Grand et al 2013;Stone et al 2010], fieldprogrammable gate arrays (FPGAs) [Herbordt et al 2008a,b], and application-specific integrated circuits (ASICs) [Shaw et al 2007[Shaw et al , 2014. While the use of GPUs for scientific applications is relatively widespread [Owens et al 2008;Preis et al 2009;Weigel 2012], the use of ASICs [Boyle et al 2005;Brown and Christ 1988;Fukushige et al 1999; and FPGAs is less common [Baity-Jesi et al 2014;Belletti et al 2009;Giefers et al 2014;Kenter et al 2017Kenter et al , 2018Meyer et al 2012], but gained attention over the last years. In general, to maximize the computational power for a given silicon area, or equivalently minimize the power-consumption per arithmetic operation, more and more computing units are replaced with lower-precision units.…”

Section: Introductionmentioning

confidence: 99%

Accurate Sampling with Noisy Forces from Approximate Computing

et al. 2020

Self Cite

View full text Add to dashboard Cite

In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low-precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations, yet still obtaining exact expectation values using a properly modified Langevin-type equation.

show abstract

Flexible FPGA design for FDTD using OpenCL

Cited by 19 publications

References 7 publications

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Accurate Sampling with Noisy Forces from Approximate Computing

Contact Info

Product

Resources

About