Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite

Meyer, Marius; Kenter, Tobias; Plessl, Christian

doi:10.1109/h2rc51942.2020.00007

Cited by 19 publications

(11 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also support this parameter in the newly added benchmarks. A more detailed description of the build process is given in our previous work [13] and in the online documentation. 2 Different hardware interfaces can be utilized for inter-FPGA communication with recent FPGA boards.…”

Section: Parallel Implementation Of Hpc Challenge Benchmarks For Fpgamentioning

confidence: 99%

“…The base implementation of the HPL benchmark uses a similar two-leveled blocked approach than the GEMM benchmark described in [13]. Thus, it uses two parameters to specify the block sizes of the local memory buffers and of the compute units as described in Table 4.…”

Section: Intelmentioning

confidence: 99%

“…In addition to the new benchmarks proposed in this paper, we also extend the existing benchmarks of our previous work [13] for the execution in a multi-FPGA environment. An essential configuration parameter for all benchmarks is the specification of kernel replications NUM_REPLICATIONS.…”

Section: Extend Existing Benchmarks Formentioning

confidence: 99%

“…However, for the HPC area, benchmark suites with relevant benchmark applications that allow the evaluation of these systems are rare. To overcome this shortage, we earlier proposed HPCC FPGA [13] based on the HPC Challenge Benchmark suite [6] targetting the HPC domain. In this previous work, we focussed on the performance characterization of a single FPGA with regards to memory access patterns of the applications.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

Meyer¹,

Kenter²,

Plessl³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication interfaces.In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communicationcentric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations.

show abstract

Section: Parallel Implementation Of Hpc Challenge Benchmarks For Fpgamentioning

confidence: 99%

Section: Intelmentioning

confidence: 99%

Section: Extend Existing Benchmarks Formentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

Meyer¹,

Kenter²,

Plessl³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…As we are mimicking the code structure of previous GPU and CPU implementations this means that we need to execute 2 unaligned loads and stores. Related work [34] evaluating fully random access patterns with nonaligned loads and stores has shown a performance of around 60M transactions per second per DDR memory bank on the Stratix 10 architecture, corresponding to around 5 clock cycles per pair of read and write operations relative to the 300MHz of the memory interface. In the gather-scatter operation for SEM, the pattern is not fully random, but also not strictly pairwise, as multiple reads have to be completed before the sums are written back to the respective locations.…”

Section: Maximizing Memory Bandwidthmentioning

confidence: 99%

A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays

Karp,

Podobas,

Kenter

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The impending termination of Moore's law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior workwhich often focuses on accelerating small kernels -we target the entire unstructured Poisson solver based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a novel data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ GFlop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator.We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.

show abstract

FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency

Nguyen

MacLean

Siracusa

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware specialization with modest non‐recurring engineering cost, but their performance and energy efficiency compared to state‐of‐the‐art processor architectures remain an open question. In this article, we use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications. In order to properly evaluate performance, we not only compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs, but we also extend the Empirical Roofline Toolkit (ERT) to FPGAs in order to assess our results in terms of the Roofline model. We show design optimization and tuning techniques for peak FPGA performance at reasonable hardware usage and power consumption. As FPGA peak performance is known to be far less than that of a GPU, we also benchmark the energy efficiency of each platform for the scientific kernels comparing against microbenchmark and technological limits. Results show that while FPGAs struggle to compete in absolute terms with GPUs on memory‐ and compute‐intensive kernels, they require far less power and can deliver nearly the same energy efficiency.

show abstract

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite

Cited by 19 publications

References 10 publications

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays

FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency

Contact Info

Product

Resources

About