An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs

Sun, Shuli; Monga, Madhu; Jones, Phillip H.; Zambreno, Joseph

doi:10.1109/tcsi.2011.2161389

Cited by 19 publications

(15 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A similar design is proposed in [5], which employs the multipleinput-multiple-output multiply-accumulate unit and a reduction unit to process multiple rows at one clock cycle, however the serial reduction limits the performance. Song Sun, et al make use of the input pattern vector (IPV) and map table to implement SpMV without pipeline stall and excessive zero-paddings [11], however, the storage of IPV and map table limits the dimension of the sparse matrix. K. Nagar, et al [12] implemented SpMV for large-scale sparse matrices on the Convey HC-1 with a novel streaming multiply-accumulator and local vector cache.…”

Section: Related Workmentioning

confidence: 99%

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Guo

Dou

Lei

et al. 2015

IEICE Electron. Express

View full text Add to dashboard Cite

This paper presents a high performance sparse matrix-vector multiplication (SpMV) accelerator on the field-programming gate array (FPGA). By exploiting a hardware-friendly storage scheme, named as Variable-Bit-Width Coordinate Block Quasi Compressed Sparse Row, the redundant computation and memory accesses can be reduced greatly through the nested block compression and variable-bit-width column-index encoding schemes. Based on the proposed compression scheme, a deeply-pipelined SpMV accelerator is implemented on a Xilinx Virtex XC7VX485T FPGA platform, which can handle sparse matrices with arbitrary size and sparsity pattern. Experimental results show that the proposed design can gain higher performance for most of the tested matrices and improve the utilization of the memory bandwidth up to 13×, compared with the previous works on the Convey platforms (HC-1 and HC-2ex) and Nvidia Tesla S1070 GPU platform.

show abstract

Section: Related Workmentioning

confidence: 99%

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Guo

Dou

Lei

et al. 2015

IEICE Electron. Express

View full text Add to dashboard Cite

show abstract

“…Depending on the implementation, the meta-data for CSR is either pre-loaded into the bitstream or dynamically accessed from external memory. While earlier designs were restricted to on-die memory capacities (e.g., [18]), more recent designs incorporate memory hierarchies that can handle large data sets exceeding the available onchip memories [24,25,26,11,10,27,9,28,29,30,14,23].…”

Section: Related Workmentioning

confidence: 99%

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Kestur

Davis

Chung

2012

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines

View full text Add to dashboard Cite

We present the design and implementation of a universal, single-bitstream library for accelerating matrixvector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtimeprogrammable decoder that performs on-the-fly decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.

show abstract

“…In this way, vitality decrease strategies must be connected in all outline levels of the framework. Besides, as the best plan choices are gotten from the building and framework level, a mind full outline at these levels can diminish the force utilization extensively [8].…”

Section: Introductionmentioning

confidence: 99%

A Review of VLSI Structure for the Implementation of Matrix Multiplication

Thakkar¹,

Rawat²

2015

IJCA

View full text Add to dashboard Cite

Matrix multiplication is the kernel operation used in many transform, image processing and digital signal processing application. In this paper, we have studied for parallel-parallel input and single output (PPI-SO), parallel-parallel input and multiple output (PPI-MO) and parallel-parallel fixed input and multiple output (PFI-MO) matrix-matrix multiplication. It is also a well-known fact that the multiplier and adder unit forms an integral part of matrix multiplication. Due to this regard, high speed multiplier and adder become the need of the day. In this paper, we have studied of Vedic mathematics multiplier using compressors.

show abstract

An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs

Cited by 19 publications

References 21 publications

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

A Review of VLSI Structure for the Implementation of Matrix Multiplication

Contact Info

Product

Resources

About