Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Kestur, Srinidhi; Davis, John D.; Chung, Eric S.

doi:10.1109/fccm.2012.12

Cited by 61 publications

(35 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…7, the proposed accelerator can obtain higher performance for most of the test matrices, compared with the implementations on the Convey HC2ex platform with four Virtex-6 LX760 FPGAs [13], HC-1 [12] and Tesla S1070 [7]. With the number of the nonzero block in one block row and the density of one increasing, the performance improvement can be higher.…”

Section: Performance Comparisonmentioning

confidence: 96%

“…However, the overhead of the word-level-encoded index data of each nonzero element limits the performance improvement. As the works in [7,8,9], the overhead can be reduced by replacing the indices with bitmap, and the indices are retrieved through the decoding before the computing. However, the performance of these works is restricted by the idle cycles in the index decoding and the zero fillings in the bitmap.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Guo

Dou

Lei

et al. 2015

IEICE Electron. Express

View full text Add to dashboard Cite

This paper presents a high performance sparse matrix-vector multiplication (SpMV) accelerator on the field-programming gate array (FPGA). By exploiting a hardware-friendly storage scheme, named as Variable-Bit-Width Coordinate Block Quasi Compressed Sparse Row, the redundant computation and memory accesses can be reduced greatly through the nested block compression and variable-bit-width column-index encoding schemes. Based on the proposed compression scheme, a deeply-pipelined SpMV accelerator is implemented on a Xilinx Virtex XC7VX485T FPGA platform, which can handle sparse matrices with arbitrary size and sparsity pattern. Experimental results show that the proposed design can gain higher performance for most of the tested matrices and improve the utilization of the memory bandwidth up to 13×, compared with the previous works on the Convey platforms (HC-1 and HC-2ex) and Nvidia Tesla S1070 GPU platform.

show abstract

Section: Performance Comparisonmentioning

confidence: 96%

Section: Related Workmentioning

confidence: 99%

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Guo

Dou

Lei

et al. 2015

IEICE Electron. Express

View full text Add to dashboard Cite

show abstract

“…More recently, the focus has shifted to efficient use of on-chip memory resources and DRAM bandwidth utilisation [5], [7], [9]. Recently, compression techniques have been proposed to improve the performance on memory bound matrices [8], [17] The constant sparsity structure in the context of iterative methods has also been exploited to optimise FPGA architectures for SpMV [18]. Static one-off pre-processing techniques are cost-effective for FPGA implementations if they can lead either to a simplified architecture [5], [7], [19] or reduced communication overhead [8], [17].…”

mentioning

confidence: 99%

“…Recently, compression techniques have been proposed to improve the performance on memory bound matrices [8], [17] The constant sparsity structure in the context of iterative methods has also been exploited to optimise FPGA architectures for SpMV [18]. Static one-off pre-processing techniques are cost-effective for FPGA implementations if they can lead either to a simplified architecture [5], [7], [19] or reduced communication overhead [8], [17]. Linear or log-linear preprocessing techniques with good performance in practice, such as the method used in this work for extracting matrix properties, have been found to be effective.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Grigoras

Burovskiy

Luk

et al. 2016

2016 26th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Abstract-Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.

show abstract

Accelerating method of moments based package‐board 3D parasitic extraction using FPGA

Devi

Gandhi

Varghese

et al. 2016

Micro & Optical Tech Letters

View full text Add to dashboard Cite

In this article, a Field Programmable Gate Array (FPGA)‐based hardware accelerator for 3D electromagnetic extraction, using Method of Moments (MoM) is presented. As the number of nets or ports in a system increases, leading to a corresponding increase in the number of right‐hand‐side (RHS) vectors, the computational cost for multiple matrix‐vector products presents a time bottleneck in a linear‐complexity fast solver framework. In this work, an FPGA‐based hardware implementation is proposed toward a two‐level parallelization scheme: (i) matrix level parallelization for single RHS and (ii) pipelining for multiple‐RHS. The method is applied to accelerate electrostatic parasitic capacitance extraction of multiple nets in a Ball Grid Array (BGA) package. The acceleration is shown to be linearly scalable with FPGA resources and speed‐ups over 10× against equivalent software implementation on a 2.4 GHz Intel Core i5 processor is achieved using a Virtex‐6 XC6VLX240T FPGA on Xilinx's ML605 board with the implemented design operating at 200 MHz clock frequency. © 2016 Wiley Periodicals, Inc. Microwave Opt Technol Lett 58:776–783, 2016

show abstract

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Cited by 61 publications

References 20 publications

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Accelerating method of moments based package‐board 3D parasitic extraction using FPGA

Contact Info

Product

Resources

About