RapidWright: Enabling Custom Crafted Implementations for FPGAs

Lavin, Christopher; Kaviani, Alireza

doi:10.1109/fccm.2018.00030

Cited by 82 publications

(29 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Read A and Transpose modules are connected with a series of FIFOs, the number of which is determined by the desired memory efficiency in reading A from DRAM. In our provided implementation, PEs are connected in a 1D sequence, and can thus be routed across the FPGA in a "snakelike" fashion [16] to maximize resource utilization with minimum routing constraints introduced by the module interconnect. The PE architecture is shown in Fig.…”

Section: Final Module Layoutmentioning

confidence: 99%

“…They often rely on abstracting away many hardware details, assuming several idealized processing units with local memory and all-to-all communication [2,5,8,9]. Those assumptions do not hold for FPGAs, where the physical area size of custom-designed processing elements (PEs) and their layout are among most important concerns in designing efficient FPGA implementations [16]. Therefore, performance modeling for reconfigurable architectures requires taking constraints like logic resources, fan-out, routing, and on-chip memory characteristics into account.With an ever-increasing diversity in available hardware platforms, and as low-precision arithmetic and exotic data types are becoming key in modern DNN [17] and linear solver [18] applications, extensibility and flexibility of hardware architectures will be crucial to stay competitive.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Licht

Kwasniewski

Hoefler

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project 1 to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.c c c c c c c c no store required of par�al productsFigure 1: (a) MMM CDAG, and (b) subcomputation V i .yields fully deterministic behavior in the circuit: accessing memory, both on-chip and off-chip, is always done explicitly, rather than by a cache replacement scheme fixed by the hardware. The models established so far, however, pose a challenge for their applicability on FPGAs. They often rely on abstracting away many hardware details, assuming several idealized processing units with local memory and all-to-all communication [2,5,8,9]. Those assumptions do not hold for FPGAs, where the physical area size of custom-designed processing elements (PEs) and their layout are among most important concerns in designing efficient FPGA implementations [16]. Therefore, performance modeling for reconfigurable architectures requires taking constraints like logic resources, fan-out, routing, and on-chip memory characteristics into account.With an ever-increasing diversity in available hardware platforms, and as low-precision arithmetic and exotic data types are becoming key in modern DNN [17] and linear solver [18] applications, extensibility and flexibility of hardware architectures will be crucial to stay competitive. Existing high-performance FPGA implementations [19,20] are implemented in hardware description languages (HDLs), which drastically constrains their maintenance, reuse, generalizability, and portability. Furthermore, the source code is not disclosed, such that third-party users cannot benefit from the kernel or build on the archi...

show abstract

Section: Final Module Layoutmentioning

confidence: 99%

mentioning

confidence: 99%

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Licht

Kwasniewski

Hoefler

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…Tools that fall into this category include RapidSmith [8] and Torc [9], which interface with Xilinx ISE, as well as RapidSmith2 [2] and RapidWright [10], which both interface with Vivado.…”

Section: Third-party Cad Tools and Related Workmentioning

confidence: 99%

“…Verilog-to-Routing (VTR) [7] is one example of an alternative CAD suite and has been commonly used for experimentation on hypothetical FPGA architectures. Tools which can target Xilinx FPGAs have also been developed, such as RapidSmith [8], Torc [9], RapidSmith2, [2] and RapidWright [10]. However, these tools have traditionally returned to the vendor tools at least to generate final bitstreams.…”

mentioning

confidence: 99%

Maverick: A Stand-Alone CAD Flow for Partially Reconfigurable FPGA Modules

Glick

Grigg

Nelson

et al. 2019

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

Circuit designs for field-programmable gate arrays (FPGAs) are typically compiled by FPGA vendor tools, such as Xilinx's Vivado Design Suite. In recent years, partial reconfiguration (PR) has emerged as a popular technique that allows portions of an FPGA to be dynamically reconfigured after the complete device has been configured with an initial bitstream. However, the nature of current FPGA vendor tools limits further innovation and possible usage models of PR. This thesis presentsMaverick, an open-source proof-of-concept computer-aided design (CAD) flow for generating reconfigurable modules (RMs) which target PR regions in FPGA designs. Maverick builds upon existing open source tools (Yosys [1], RapidSmith2 [2], and Project X-Ray [3]) to form an end-to-end compilation flow. After an initial static design and PR region are created with Xilinx's Vivado PR flow, Maverick can then compile and configure RMs onto that PR region-without the use of vendor tools. In addition, this work enables users to import and export RMs between Vivado and RapidSmith2.Furthermore, this thesis demonstrates Maverick compiling RMs on both a desktop computer and on the embedded PYNQ-Z1 board, which contains a Zynq 7020 system on chip (SoC). Maverick runs on the ARM processor embedded within the processing system (PS) of the Zynq device, generating partial bitstreams which can then be configured onto a PR region within the programmable logic (PL) fabric of the same Zynq device. This unique case, not possible with current vendor tools like Vivado, demonstrates the feasibility of a single-chip embedded system which can both compile HDL designs to bitstreams and then configure them onto its own programmable fabric.

show abstract

“…This so called register recycling reduces significantly the amount of inserted registers. The optimized results are then implemented by inserting the pipeline registers using RapidWright [7]. RapidWright is an open-source, Javabased framework from Xilinx, which allows user to access lower level architecture details and make netlist level manipulations, using high-level Java programming.…”

Section: Introductionmentioning

confidence: 99%

Optimizing FPGA-Based Streaming Applications for Throughput Using Pipelining

Ali

Loo

Kruiper

et al. 2019

2019 International Conference on Field-Programmable Technology (ICFPT)

View full text Add to dashboard Cite

In this paper, we present an automated flow for insertion of pipeline stages in FPGA-based streaming applications in order to increase the throughput. The proposed approach involves the utilization of Xilinx's Automated Pipeline Analysis tool to estimate the number of pipeline stages, while the Rapid-Wright framework incorporate these stages into a synthesized design. The Vivado Design Suite is then used to place and route the modified netlist. Furthermore, a recycling approach has also been proposed to reduce excess registers. The results show a significant improvement in the maximum operating frequency for designs without any sequential loops (~51%) with a moderate resource overhead, while slight gains (~12%) were also observed for designs containing feedback loops.

show abstract

RapidWright: Enabling Custom Crafted Implementations for FPGAs

Cited by 82 publications

References 11 publications

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Maverick: A Stand-Alone CAD Flow for Partially Reconfigurable FPGA Modules

Optimizing FPGA-Based Streaming Applications for Throughput Using Pipelining

Contact Info

Product

Resources

About