A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication Using High-Level Synthesis

Hosseinabady, Mohammad; Nunez-Yanez, Jose

doi:10.1109/tcad.2019.2912923

Cited by 24 publications

(6 citation statements)

References 22 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work proposes an optimized algorithm structure for MGS-QRD through loop optimization techniques and algorithm restructuring, improved TMI by eliminating redundant matrix operations and the right choice of matrix multiplication to be integrated into a fast and efficient pseudoinverse computation hardware accelerator on FPGA. FPGA's synthesizable logic fabrics, which allow high parallelism, enable the delivering of high computational speed as demonstrated in some previous works; [6][7][8][9] moreover, it also provides flexibility and programmability with real-time embedded environment solutions. These advantages make FPGA a good potential as a hardware accelerator.…”

Section: Introductionmentioning

confidence: 93%

Efficient hardware‐accelerated pseudoinverse computation through algorithm restructuring for parallelization in high‐level synthesis

Tan

Ooi

Choo

et al. 2021

Circuit Theory & Apps

View full text Add to dashboard Cite

This paper describes a fast and efficient hardware-accelerated pseudoinverse computation through algorithm restructuring and leveraging FPGA synthesis directives for parallelism prior to high-level synthesis (HLS). The algorithm, which is composed of modified Gram-Schmidt QR decomposition (MGS-QRD), triangular matrix inversion (TMI), and matrix multiplication (MM), is synthesized and implemented on a field-programmable gate array (FPGA). MGS-QRD is restructured and augmented with parallelism directives prior to synthesizing the algorithm, which yielded an MGS-QRD hardware accelerator with high throughput. Modifications to the current TMI algorithm were also proposed, in which the removal of redundant computational tasks was done in order to speed up overall operation. Data dependencies in the MM algorithm were carefully considered such that appropriate parallelism directives were inserted, and matching the data flow of MM with MGS-QRD and TMI modules was also performed to accelerate the pseudoinverse computation. The results showed that the proposed pseudoinverse module is better than the naïve implementation which is composed of existing MGS-QRD, TMI and a standard MM in terms of maximum frequency (1.24Â speedup), hardware resources (48% of reduction of DSP usage), latency (23% reduction), and throughput (62% increase).

show abstract

Section: Introductionmentioning

confidence: 93%

Efficient hardware‐accelerated pseudoinverse computation through algorithm restructuring for parallelization in high‐level synthesis

Tan

Ooi

Choo

et al. 2021

Circuit Theory & Apps

View full text Add to dashboard Cite

show abstract

“…The next step consisted of comparing the results with an open library of the SpMV in FPGAs [14]. The library implements a stream version of the SpMV in the traditional CSR format.…”

Section: Comparison With An Open Librarymentioning

confidence: 99%

“…A cache vector strategy was presented in [12] and later utilized in [13], the caching scheme aims at maximizing the data reuse of the multiplying vector by performing a preprocess that determines the cache misses that later work as an input for the algorithm. The cache misses have also been treated by fully transferring the multiplying vector into the BRAM [14]. These works aim to be efficient in a broad set of matrices that does not profit the problem information of the sparsity pattern in our CFD matrices.…”

Section: Introductionmentioning

confidence: 99%

An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations

Oyarzun¹,

Peyrolon²,

Álvarez³

et al. 2021

Preprint

View full text Add to dashboard Cite

Field Programmable Gate Arrays generate algorithmic specific architectures that improve the codes' FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithms' performance. Commonly, those algorithms have a low arithmetic intensity and only reach a small percentage of the peak performance. The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations. The matrix's sparsity pattern determines the indirect memory accesses of the multiplying vector. This data path is hard to predict, making traditional implementations fail. In this work, we present an FPGA architecture that maximizes the vector's re-usability by introducing a cache-like architecture. The cache is implemented as a circular list that maintains the BRAM vector components while needed. Following this strategy, up to 16 times of acceleration is obtained compared to a naive implementation of the algorithm.

show abstract

“…In [18] the authors propose a streaming dataflow architecture to perform SPMV operation in an embedded platform containing a Xilinx ZynqMP FPGA. The proposed solution consists of a deep pipeline that is constantly consuming input data with no stalls.…”

Section: Related Workmentioning

confidence: 99%

“…For the Xilinx's Data Center platform we tested the GEMM implementation from [14] and developed our version of SPMV based on the work in [18].…”

Section: High-end Fpgamentioning

confidence: 99%

Energy-efficient algebra kernels in FPGA for High Performance Computing

Favaro¹,

Dufrechou

Ezzatti

et al. 2021

JCS&T

View full text Add to dashboard Cite

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.

show abstract

A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication Using High-Level Synthesis

Cited by 24 publications

References 22 publications

Efficient hardware‐accelerated pseudoinverse computation through algorithm restructuring for parallelization in high‐level synthesis

Efficient hardware‐accelerated pseudoinverse computation through algorithm restructuring for parallelization in high‐level synthesis

An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations

Energy-efficient algebra kernels in FPGA for High Performance Computing

Contact Info

Product

Resources

About