Comparison of High Level FPGA Hardware Design for Solving Tri-diagonal Linear Systems

Warne, David J.; Kelson, Neil A.; Hayward, Ross

doi:10.1016/j.procs.2014.05.009

Cited by 16 publications

(10 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As such, our underlying goal is to understand the criteria for a given system solver to be amenable to FPGA implementation and uncover the limitations and profitability of such accelerators. Previous work on tridiagonal system solvers for FPGAs utilized both low-level hardware description languages [18], [28], [31] as well as high-level synthesis tools [3], [14], [15], [16], [29]. They demonstrated implementation of standard tridiagonal system solver algorithms (Thomas, PCR, and Spike), evaluating how to best utilize FPGA resources to maximize performance.…”

Section: Introductionmentioning

confidence: 99%

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Kamalakkannan¹,

Reguly²,

Fahmy³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance computing applications on Xilinx FPGA hardware. Key new features of the library are (1) the unification of standard state-of-the-art techniques for implementing implicit numerical solvers with a number of novel high-gain optimizations such as vectorization and batching, motivated by multiple multi-dimensional systems common in real-world applications, (2) data-flow techniques that provide application specific optimizations for both 2D and 3D problems, including integration of explicit loops commonplace in real workloads, and (3) the development of a predictive analytic model to explore the design space, and obtain rapid resource and performance estimates. The new library provide an order of magnitude better performance when solving large batches of systems compared to Xilinx's current tridiagonal solver library. Two representative applications are implemented using the new solver on a Xilinx Alveo U280 FPGA, demonstrating over 85% predictive model accuracy. These are compared with a current state-of-the-art GPU library for solving multi-dimensional tridiagonal systems on an Nvidia V100 GPU, analyzing time to solution, bandwidth, and energy consumption. Results show the FPGAs achieving competitive or better runtime performance for a range of multi-dimensional mesh problems compared to the V100 GPU. Additionally, the significant energy savings offered by FPGA implementations, over 30% for the most complex application, are quantified. We discuss the algorithmic trade-offs required to obtain good performance on FPGAs, giving insights into the feasibility and profitability of FPGA implementations.

show abstract

Section: Introductionmentioning

confidence: 99%

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Kamalakkannan¹,

Reguly²,

Fahmy³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Using the Altera based Opencl, Warne et al [18,19] demonstrated the ease in which a custom tridiagonal linear system solver can be deployed. In this work, we extend our previous efforts towards a more general highly parallel solution, targeting fpgas in particular, but also other Opencl compliant co-processors that may be present within a heterogeneous computing environment.…”

Section: C448mentioning

confidence: 99%

“…We converted the estimated throughput to system solves per second (rather than more usual flop based measures) to ease comparison with the most relevant studies. In reality, data transfer overheads can impede this throughput [19]; however, there are a number of coding practices which can assist in minimising this impact [2].…”

Section: Compute Performancementioning

confidence: 99%

Implementation of parallel tridiagonal solvers for a heterogeneous computing environment

Macintosh¹,

Warne²,

Kelson³

et al. 2016

ANZIAMJ

View full text Add to dashboard Cite

Tridiagonal diagonally dominant linear systems arise in many scientific and engineering applications. The standard Thomas algorithm for solving such systems is inherently serial, forming a bottleneck in computation. Algorithms such as cyclic reduction and spike reduce a single large tridiagonal system into multiple small independent systems which are solved in parallel. We develop portable cyclic reduction and the spike algorithm for Open Computing Language implementations on a range of co-processors in a heterogeneous computing environment, including field programmable gate arrays, graphics processing units and other multi-core processors. We evaluate these designs in the context of solver performance, resource efficiency and numerical accuracy.

show abstract

“…FPGA accelerator cards require an order of magnitude less power compared to HPC grade CPUs and GPUs. Previous efforts in developing FPGA-based routines to solve tridiagonal systems have been limited to solving small systems with the serial omas algorithm [11][12][13]. We have previously investigated the feasibility of FPGA implementations of parallel algorithms including the parallel cyclic reduction and SPIKE [14] for solving small tridiagonal linear systems.…”

Section: Introductionmentioning

confidence: 99%

Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs

Macintosh

Banks

Kelson

2019

International Journal of Reconfigurable Computing

Self Cite

View full text Add to dashboard Cite

Solving diagonally dominant tridiagonal linear systems is a common problem in scientific high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator’s compute performance. The truncated SPIKE FPGA solver is developed first for optimising OpenCL device kernel performance, global memory bandwidth, and interleaved host to device memory transactions. The FPGA OpenCL kernel code is then refactored and optimised to best exploit the underlying architecture of the CPU and GPU. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150%, and 170% faster, respectively, than comparable device-optimised third-party solvers and applicable baselines. Assessing heterogeneous combinations of compute devices, the GPU + FPGA combination is found to have the best compute performance and the FPGA-only configuration is found to have the best overall estimated energy efficiency.

show abstract

Comparison of High Level FPGA Hardware Design for Solving Tri-diagonal Linear Systems

Cited by 16 publications

References 8 publications

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Implementation of parallel tridiagonal solvers for a heterogeneous computing environment

Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs

Contact Info

Product

Resources

About