2010
DOI: 10.1007/s10766-010-0131-8
|View full text |Cite
|
Sign up to set email alerts
|

FPGA Based High Performance Double-Precision Matrix Multiplication

Abstract: We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
23
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 30 publications
(23 citation statements)
references
References 11 publications
0
23
0
Order By: Relevance
“…They obtain a performance of 2.06 GFLOPS for a 1K by 1K matrix multiply on a Cray XD1 accelerator. Kumar et al [4] use a rank-1 update scheme to implement parallel processing elements. Sub blocks of the matrices are streamed to the architecture and intermediate results are accumulated, allowing communication and computation overlap.…”
Section: Related Workmentioning
confidence: 99%
“…They obtain a performance of 2.06 GFLOPS for a 1K by 1K matrix multiply on a Cray XD1 accelerator. Kumar et al [4] use a rank-1 update scheme to implement parallel processing elements. Sub blocks of the matrices are streamed to the architecture and intermediate results are accumulated, allowing communication and computation overlap.…”
Section: Related Workmentioning
confidence: 99%
“…[9] uses an algorithm for scheduling input data to processing elements which has the same loop execution order as that of Zhuo and Prasanna [8]. However, instead of a systolic array-like structure (in which every PE communicates only with the adjacent ones), it uses broadcast to distribute the same elements of the first matrix simultaneously to all PEs.…”
Section: Related Workmentioning
confidence: 99%
“…Work of Kumar et al. [9] is more recent, and use FPGAs with 25x18 multipliers. In spite of that, their floating-point multiplier design require 13 such blocks.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations