FPGA Based High Performance Double-Precision Matrix Multiplication

Kumar, Vinay; Joshi, Siddharth; Patkar, Sachin B.; Narayanan, Hariharan

doi:10.1007/s10766-010-0131-8

Cited by 30 publications

(23 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They obtain a performance of 2.06 GFLOPS for a 1K by 1K matrix multiply on a Cray XD1 accelerator. Kumar et al [4] use a rank-1 update scheme to implement parallel processing elements. Sub blocks of the matrices are streamed to the architecture and intermediate results are accumulated, allowing communication and computation overlap.…”

Section: Related Workmentioning

confidence: 99%

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

D’Hollander

2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a blockoriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using highlevel synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices.

show abstract

Section: Related Workmentioning

confidence: 99%

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

D’Hollander

2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…[9] uses an algorithm for scheduling input data to processing elements which has the same loop execution order as that of Zhuo and Prasanna [8]. However, instead of a systolic array-like structure (in which every PE communicates only with the adjacent ones), it uses broadcast to distribute the same elements of the first matrix simultaneously to all PEs.…”

Section: Related Workmentioning

confidence: 99%

“…Work of Kumar et al. [9] is more recent, and use FPGAs with 25x18 multipliers. In spite of that, their floating-point multiplier design require 13 such blocks.…”

Section: Related Workmentioning

confidence: 99%

“…[7]. Zhuo and Prasanna [8] describe three floating-point multiplier and adder implementations with different level of IEEE standard compliance: the "fully-compliant", "moderately-compliant" (similar to those in [7] and [9]) and "least-compliant" (which, in addition to the absence of denormal support, also lack all the rounding modes except "round toward zero" and does not generate exceptions). They specify the number of pipeline stages, area, and clock frequency for adders and multipliers using all three level of compliance.…”

Section: Related Workmentioning

confidence: 99%

“…However, the more useful measure of accelerator performance is the time necessary to multiply two matrices of given size. According to Algorithm 2, the accelerator can multiply matrices X and Y in T comp = i×j×k×n The Algorithm 1 is equivalent to the third algorithm from [8], and algorithms from [7] and [9] when they use minimal local memory size of Ω(k 2 ) words. They have the constant bandwidth requirements of B HD =3 and B FD =2 words per clock cycle.…”

Section: Accelerator Architecturementioning

confidence: 99%

See 2 more Smart Citations

FPGA accelerator for floating-point matrix multiplication

Jovanovic

Milutinović

2012

IET Comput. Digit. Tech.

View full text Add to dashboard Cite

Abstract:This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.

show abstract

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

Zhang

Peng

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

FPGA Based High Performance Double-Precision Matrix Multiplication

Cited by 30 publications

References 11 publications

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

FPGA accelerator for floating-point matrix multiplication

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

Contact Info

Product

Resources

About