2017
DOI: 10.1145/3039902.3039916
|View full text |Cite
|
Sign up to set email alerts
|

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

Abstract: In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
5
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 6 publications
0
5
0
Order By: Relevance
“…Much work has been done in optimizing C/C++/OpenCL HLS codes for FPGA, such as stencils [36], [37], [38], [67], [68], deep neural networks [69], [70], [50], matrix multiplication [71], [68], graph processing [72], [73], and protein sequencing [74], [75]. These works optimize the respective applications using transformations described here, such as delay buffering, vectorization, replication, and streaming.…”
Section: Related Workmentioning
confidence: 99%
“…Much work has been done in optimizing C/C++/OpenCL HLS codes for FPGA, such as stencils [36], [37], [38], [67], [68], deep neural networks [69], [70], [50], matrix multiplication [71], [68], graph processing [72], [73], and protein sequencing [74], [75]. These works optimize the respective applications using transformations described here, such as delay buffering, vectorization, replication, and streaming.…”
Section: Related Workmentioning
confidence: 99%
“…Much of previous work focuses on the low level implementation for performance [20], explores high-level optimizations [28], or implements MMM in the context of neural networks [19,29]. To the best of our knowledge, this is the first work to minimize I/O of matrix multiplication on FPGA in terms of hardware constants, and the first work to open source our implementation to benefit of the community.…”
Section: Related Workmentioning
confidence: 99%
“…The authors derive the required off-chip bandwidth and buffer space required to achieve peak performance on the target device, but do not model or optimize I/O in terms of their buffer space usage, and do not report their tile sizes or how they were chosen. Furthermore, the authors double-buffer the output tile, reducing the maximum achievable computational intensity by a factor [30] 2004 Virtex-II Pro 98 128 -2 2 -HDL ( ) Dou [31] 2005 Virtex-II Pro 99 177 --39 -HDL ( ) Kumar [32] 2009 Virtex-5 61 373 † --30 † -HDL ( ) Jovanović [20] 2012 Virtex-6 100 403 -203 --HDL ( ) D'Hollander [28] 2016…”
Section: Related Workmentioning
confidence: 99%
“…One such application is the Fast Fourier Transform (FFT) and other algorithms based on it [22]- [24]. Other areas include neural networks [25], matrix multiplication [26], digital filters [27], [28], communication systems [29] and more.…”
Section: Introductionmentioning
confidence: 99%