Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs

Macintosh, Hamish; Banks, Jasmine; Kelson, Neil A.

doi:10.1155/2019/3679839

Cited by 8 publications

(11 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the introduction of High-Level synthesis (HLS) tools, a number of more recent works [14], [15], [16], [29] implemented the Thomas, PCR, and Spike algorithms on FPGA using HLS tools. Many of these works did not demonstrate the solver working on full applications, with the exception of Lászl ó et al in 2015 [14] which compared a one factor Black-Scholes option pricing equation using explicit and implicit methods on different architectures such as multi core CPUs, GPUs, and FPGAs.…”

Section: Related Workmentioning

confidence: 99%

“…The FPGA performance with PCR is shown to be comparable to that of the GPU, but the Spike algorithm on the FPGA outperforms the GPU. Similarly Macintosh, et al in 2019 [15] uses OpenCL to develop oclspkt, a library that implements tridiagonal systems solvers targeting FPGAs, GPUs, and CPUs. oclspkt uses the truncated spike algorithm, and as such will not give exact solutions.…”

Section: Related Workmentioning

confidence: 99%

“…As such, our underlying goal is to understand the criteria for a given system solver to be amenable to FPGA implementation and uncover the limitations and profitability of such accelerators. Previous work on tridiagonal system solvers for FPGAs utilized both low-level hardware description languages [18], [28], [31] as well as high-level synthesis tools [3], [14], [15], [16], [29]. They demonstrated implementation of standard tridiagonal system solver algorithms (Thomas, PCR, and Spike), evaluating how to best utilize FPGA resources to maximize performance.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Kamalakkannan¹,

Reguly²,

Fahmy³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance computing applications on Xilinx FPGA hardware. Key new features of the library are (1) the unification of standard state-of-the-art techniques for implementing implicit numerical solvers with a number of novel high-gain optimizations such as vectorization and batching, motivated by multiple multi-dimensional systems common in real-world applications, (2) data-flow techniques that provide application specific optimizations for both 2D and 3D problems, including integration of explicit loops commonplace in real workloads, and (3) the development of a predictive analytic model to explore the design space, and obtain rapid resource and performance estimates. The new library provide an order of magnitude better performance when solving large batches of systems compared to Xilinx's current tridiagonal solver library. Two representative applications are implemented using the new solver on a Xilinx Alveo U280 FPGA, demonstrating over 85% predictive model accuracy. These are compared with a current state-of-the-art GPU library for solving multi-dimensional tridiagonal systems on an Nvidia V100 GPU, analyzing time to solution, bandwidth, and energy consumption. Results show the FPGAs achieving competitive or better runtime performance for a range of multi-dimensional mesh problems compared to the V100 GPU. Additionally, the significant energy savings offered by FPGA implementations, over 30% for the most complex application, are quantified. We discuss the algorithmic trade-offs required to obtain good performance on FPGAs, giving insights into the feasibility and profitability of FPGA implementations.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Kamalakkannan¹,

Reguly²,

Fahmy³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…OpenMP is also not the only way to access SIMD within C/C++. For example, OpenCL kernels may be compiled for CPUs that support SIMD units (Hurn et al, 2016;Macintosh et al, 2019). Instruction level intrinsic functions (available in R via the RcppXsimd package) allow advanced features such as efficient random variates for small vectors, but this approach is very challenging and akin to machine code.…”

Section: Discussionmentioning

confidence: 99%

Vector Operations for Accelerating Expensive Bayesian Computations – A Tutorial Guide

2021

View full text Add to dashboard Cite

Many applications in Bayesian statistics are extremely computationally intensive. However, they are often inherently parallel, making them prime targets for modern massively parallel processors. Multi-core and distributed computing is widely applied in the Bayesian community, however, very little attention has been given to fine-grain parallelisation using single instruction multiple data (SIMD) operations that are available on most modern CPUs. In this work, we practically demonstrate, using standard programming libraries, the utility of the SIMD approach for several topical Bayesian applications. Using the C programming language, we show that SIMD can improve the single-core floating point arithmetic performance by up to a factor of 6× compared scalar C code and more than 25× compared with optimised R code. Such improvements are multiplicative to any gains achieved through multi-core processing. We illustrate the potential of SIMD for accelerating Bayesian computations and provide the reader with techniques for exploiting modern massively parallel processing environments.

show abstract

“…e architecture was implemented in Xilinx Kintex-7 FPGA and compared to the software algorithm. FPGAs offer high flexibility to Application-Specific Integrated Circuit (ASIC) when implementing the algorithm with a high degree of parallelism [9,10]. Results show that 37-75 times performance enhancement could be achieved with the accelerator's clock frequency at 100 MHz.…”

Section: Introductionmentioning

confidence: 99%

FPGA Implementation of A∗ Algorithm for Real-Time Path Planning

Zhou

Jin

Wang

2020

International Journal of Reconfigurable Computing

View full text Add to dashboard Cite

The traditional A∗ algorithm is time-consuming due to a large number of iteration operations to calculate the evaluation function and sort the OPEN list. To achieve real-time path-planning performance, a hardware accelerator’s architecture called A∗ accelerator has been designed and implemented in field programmable gate array (FPGA). The specially designed 8-port cache and OPEN list array are introduced to tackle the calculation bottleneck. The system-on-a-chip (SOC) design is implemented in Xilinx Kintex-7 FPGA to evaluate A∗ accelerator. Experiments show that the hardware accelerator achieves 37–75 times performance enhancement relative to software implementation. It is suitable for real-time path-planning applications.

show abstract

Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs

Cited by 8 publications

References 15 publications

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Vector Operations for Accelerating Expensive Bayesian Computations – A Tutorial Guide

FPGA Implementation of A∗ Algorithm for Real-Time Path Planning

Contact Info

Product

Resources

About