Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Cong, Jason; Wei, Peng; Yu, Cody Hao; Zhang, Peng

doi:10.1109/dac.2018.8465940

Cited by 21 publications

(22 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although several works have been proposed [6,9,[41][42][43]] to automatically explore a large set of FPGA designs, they all assume the initial design to be fully cached or cacheable. For the sake of conciseness, in this Section we only focus on approaches exploring directive-insertion optimizations that, as demonstrated by Siracusa et al [31], are of particular interest in mixed optimization methodologies.…”

Section: Related Workmentioning

confidence: 99%

A CAD-based methodology to optimize HLS code via the roofline model

Siracusa

Tucci²,

Rabozzi³

et al. 2020

Proceedings of the 39th International Conference on Computer-Aided Design

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

A CAD-based methodology to optimize HLS code via the roofline model

Siracusa

Tucci²,

Rabozzi³

et al. 2020

Proceedings of the 39th International Conference on Computer-Aided Design

View full text Add to dashboard Cite

“…General HLS compilers -Beyond generating systolic arrays, there is also a plethora of work targeting implementing general applications on FPGAs [3,6,16,21,34]. However, experimental results show that there still exists a performance gap between such frameworks and dedicated systolic array compilers like SuSy.…”

Section: Related Workmentioning

confidence: 99%

SuSy

Lai

Rong

Zheng

et al. 2020

Proceedings of the 39th International Conference on Computer-Aided Design

Self Cite

View full text Add to dashboard Cite

“…Figure 1 illustrates our new contributions, highlighted with bold and red, relative to the prior HLS literature. There exists many automated approaches for generating device and host interfaces [20,61,83], exploring parallelization opportunities [24,34,46,83…”

Section: Heterorefactor Workflowmentioning

confidence: 99%

“…The kernels we selected are slightly slower than running on CPU because I6 and I7 in Rosetta are designed to achieve higher energy efficiency but not higher processing throughput compared to CPU [84]. HeteroRefactor aims to reduce resource usage, while prior work [19,24] achieves higher performance than CPU by leveraging more on-chip resources to achieve parallelism. HeteroRefactor could be used jointly with other tools to produce fast and resource-efficient FPGA accelerators.…”

Section: Overhead and Performancementioning

confidence: 99%

See 1 more Smart Citation

HeteroRefactor

Lau

Sivaraman

Zhang

et al. 2020

Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Self Cite

View full text Add to dashboard Cite

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants-the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator's resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that Het-eroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.

show abstract

Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Cited by 21 publications

References 24 publications

A CAD-based methodology to optimize HLS code via the roofline model

A CAD-based methodology to optimize HLS code via the roofline model

SuSy

HeteroRefactor

Contact Info

Product

Resources

About