FPGA accelerator for floating-point matrix multiplication

Jovanovic, Zeljko; Milutinović, Veljko

doi:10.1049/iet-cdt.2011.0132

Cited by 63 publications

(33 citation statements)

References 19 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Theoretical analysis of an 800 × 800 matrix multiplication shows an execution time of 10 7 cycles. Jovanović and Milutinović [3] present an architecture of = 252 processing elements with local memories to store the input matrices. Large matrices are multiplied by sending blocks to the accelerator.…”

Section: Related Workmentioning

confidence: 99%

“…The efforts to optimize the I/O result in a significant performance increase to 6,295 MFLOPS for = 124 and = 3 , i.e. 93% of the theoretical performance 0 (124,3) using equation (3).…”

Section: Overlapping Computation and Communicationmentioning

confidence: 99%

“…As a consequence both input matrices and are sent two times. It is possible to perform a loop interchange and an iteration reordering such that each block matrix is sent only once, a technique also used ad hoc in [3]. By generating the index tuple ( , , ) using the ( , 3)-ary generalized Gray code [5], the -loop becomes outermost and matrices or can be reused in the inner loops.…”

Section: Data Reuse Using Gray Code Block Orderingmentioning

confidence: 99%

See 2 more Smart Citations

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

D’Hollander

2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a blockoriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using highlevel synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The efforts to optimize the I/O result in a significant performance increase to 6,295 MFLOPS for = 124 and = 3 , i.e. 93% of the theoretical performance 0 (124,3) using equation (3).…”

Section: Overlapping Computation and Communicationmentioning

confidence: 99%

Section: Data Reuse Using Gray Code Block Orderingmentioning

confidence: 99%

See 1 more Smart Citation

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

D’Hollander

2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…This is due to the fact that having FPGAs with limited resources it is hardly possible to instantiate that many PEs. A recent work [17] describes an architecture of linear array PEs, similar to those in [16], but achieving an optimal latency of order O(n 2 ) by exploiting full duplex communication with the host processor and at the cost of having it involved during addition of intermediary values.…”

Section: Matrix Multiplication Tradeoffs On Fpgasmentioning

confidence: 99%

FPGA design and implementation of a matrix multiplier based accelerator for 3D EKF SLAM

Tertei

Piat

Devy

2014

2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14)

View full text Add to dashboard Cite

International audienceIn hw/sw co-design FPGAs are being used in order to accelerate existing solutions so they meet real-time constraints. As they consume less power than a standard microprocessor and provide powerful parallel data processing capabilities, they remain a highly optimizable tool and object of research within an embedded system. In this paper we present an efficient architecture for matrix multiplication accelerator conceived as a systolic array co-processor to IBM's PPC440 processor on Virtex5 XC5VFX70T FPGA. Our design is afterwards synthesized and wired as a large-scale matrix multiplier required for an embedded version of a visual Simultaneous Localization and Mapping (SLAM) algorithm based on Extended Kalman Filter (EKF). This algorithm is implemented entirely as a System On a programmable Chip (SoC) design on the FPGA; an EKF epoch is executed at least 7.3 times faster than the pure software implementation, maintaining and correcting 20 points in the map. This optimization permits an EKF block throughput to be increased from 6.07Hz to 44.39Hz, which exceeds our real-time constraint of 30Hz

show abstract

“…Recently, field programmable gate arrays (FPGAs) have become widely used as accelerators of software operations [1] [2]. However, since an FPGA is always used along with a single configuration context, its benefits are limited to its programmability.…”

mentioning

confidence: 99%

A dynamic optically reconfigurable gate array using a blue laser

Kobayashi

Watanabe

2013

2013 IEEE 4th International Conference on Photonics (ICP)

View full text Add to dashboard Cite

Recently, optically reconfigurable gate arrays (OR GAs), which can support a high-speed dynamic reconfiguration with numerous reconfiguration contexts, have been developed.Although an ORGA is a three-dimensional VLSI, no through silicon via (TSV) technology or any micro-bump technology is never necessary to produce an ORGA. A three-dimensional ORGA uses only free-optical connections and a volume-type holographic memory technology. Therefore, the yield ratio of ORGAs is so high that ORGAs can easily be produced with no concern related to production variation. In this study, to increase the gate density, a short wavelength laser of 404 nm is applied to ORGA architecture. This paper presents the reconfiguration capabilities of the reconfiguration period and retention time of the photodiode memory architecture of a newly fabricated ORGA· VLSI.

show abstract

FPGA accelerator for floating-point matrix multiplication

Cited by 63 publications

References 19 publications

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

FPGA design and implementation of a matrix multiplier based accelerator for 3D EKF SLAM

A dynamic optically reconfigurable gate array using a blue laser

Contact Info

Product

Resources

About