High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

D’Hollander, Erik H.

doi:10.1145/3039902.3039916

Cited by 7 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Much work has been done in optimizing C/C++/OpenCL HLS codes for FPGA, such as stencils [36], [37], [38], [67], [68], deep neural networks [69], [70], [50], matrix multiplication [71], [68], graph processing [72], [73], and protein sequencing [74], [75]. These works optimize the respective applications using transformations described here, such as delay buffering, vectorization, replication, and streaming.…”

Section: Related Workmentioning

confidence: 99%

Transformations of High-Level Synthesis Codes for High-Performance Computing

Licht

Besta

Meierhans

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS.

show abstract

Section: Related Workmentioning

confidence: 99%

Transformations of High-Level Synthesis Codes for High-Performance Computing

Licht

Besta

Meierhans

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Much of previous work focuses on the low level implementation for performance [20], explores high-level optimizations [28], or implements MMM in the context of neural networks [19,29]. To the best of our knowledge, this is the first work to minimize I/O of matrix multiplication on FPGA in terms of hardware constants, and the first work to open source our implementation to benefit of the community.…”

Section: Related Workmentioning

confidence: 99%

“…The authors derive the required off-chip bandwidth and buffer space required to achieve peak performance on the target device, but do not model or optimize I/O in terms of their buffer space usage, and do not report their tile sizes or how they were chosen. Furthermore, the authors double-buffer the output tile, reducing the maximum achievable computational intensity by a factor [30] 2004 Virtex-II Pro 98 128 -2 2 -HDL ( ) Dou [31] 2005 Virtex-II Pro 99 177 --39 -HDL ( ) Kumar [32] 2009 Virtex-5 61 373 † --30 † -HDL ( ) Jovanović [20] 2012 Virtex-6 100 403 -203 --HDL ( ) D'Hollander [28] 2016…”

Section: Related Workmentioning

confidence: 99%

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Licht

Kwasniewski

Hoefler

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project 1 to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.c c c c c c c c no store required of par�al productsFigure 1: (a) MMM CDAG, and (b) subcomputation V i .yields fully deterministic behavior in the circuit: accessing memory, both on-chip and off-chip, is always done explicitly, rather than by a cache replacement scheme fixed by the hardware. The models established so far, however, pose a challenge for their applicability on FPGAs. They often rely on abstracting away many hardware details, assuming several idealized processing units with local memory and all-to-all communication [2,5,8,9]. Those assumptions do not hold for FPGAs, where the physical area size of custom-designed processing elements (PEs) and their layout are among most important concerns in designing efficient FPGA implementations [16]. Therefore, performance modeling for reconfigurable architectures requires taking constraints like logic resources, fan-out, routing, and on-chip memory characteristics into account.With an ever-increasing diversity in available hardware platforms, and as low-precision arithmetic and exotic data types are becoming key in modern DNN [17] and linear solver [18] applications, extensibility and flexibility of hardware architectures will be crucial to stay competitive. Existing high-performance FPGA implementations [19,20] are implemented in hardware description languages (HDLs), which drastically constrains their maintenance, reuse, generalizability, and portability. Furthermore, the source code is not disclosed, such that third-party users cannot benefit from the kernel or build on the archi...

show abstract

“…One such application is the Fast Fourier Transform (FFT) and other algorithms based on it [22]- [24]. Other areas include neural networks [25], matrix multiplication [26], digital filters [27], [28], communication systems [29] and more.…”

Section: Introductionmentioning

confidence: 99%

Optimizing FDTD Memory Bandwidth by Using Block Float-Point Arithmetic

Pijetlovic¹,

Subotic²,

Pjevalica³

2018

ElAEE

View full text Add to dashboard Cite

Finite-difference time-domain is a numerical method used for modelling of computational electrodynamics. The method is resource intensive, especially regarding memory usage. Multiple memory accesses are required per single computation so memory bandwidth acts as a bottleneck limiting the overall performance. Existing solutions use either fixedpoint or floating-point arithmetic, depending on the complexity of the target platform, to model the data. Floating-point requires less memory access but the computation is more intensive due to the normalisation. Fixed-point is the oppositesimple computation but with more memory access for the same precision. The novelty of this paper is in the block floatingpoint realization which is the middle ground between the two. The approach is less compute intensive than the floating-point solutions while using less memory than the fixed-point realization. This makes the solution an alternative for bit-exact platforms, such as field-programmable gate arrays. The results are compared to both floating-point and fixed-point implementations and the memory bandwidth and other resources needed for targeted platform are calculated.

show abstract

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

Cited by 7 publications

References 6 publications

Transformations of High-Level Synthesis Codes for High-Performance Computing

Transformations of High-Level Synthesis Codes for High-Performance Computing

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Optimizing FDTD Memory Bandwidth by Using Block Float-Point Arithmetic

Contact Info

Product

Resources

About