FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Papakonstantinou, Alexandros; Gururaj, Karthik; Stratton, John A.; Chen, Deming; Cong, Jason; Hwu, Wen-mei W.

doi:10.1109/sasp.2009.5226333

Cited by 126 publications

(73 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Overall, VThreads with its hardware-assisted PThreads support demonstrates better performance to the Leon3MP system particularly for highly-parallel workloads such as Mandelbrot and Sobel filter; the pattern on more complex benchmarks such as JPEG decode and DES is slightly different with the Leon3MP system demonstrating better performance at higher core counts (8). This can be attributed to the very fast customPThreads implementation in which all cores are active in a very tight polling loop in a shared-memory system whereas in VThreads, the DBG_IF PThreads mechanism in Section 3.6 is a natural synchronization point which can be further optimized if implemented in a pipelined fashion such as the MPI coprocessors of [31].…”

Section: Discussion Of Resultsmentioning

confidence: 98%

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Chouliaras

Stevens

Dwyer

2016

Microprocessors and Microsystems

View full text Add to dashboard Cite

performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support.

show abstract

Section: Discussion Of Resultsmentioning

confidence: 98%

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Chouliaras

Stevens

Dwyer

2016

Microprocessors and Microsystems

View full text Add to dashboard Cite

show abstract

“…Thus the ML-GPS framework can efficiently complete the design space exploration within minutes (rather than days if synthesis and physical implementation were used). More importantly, the design space point selected by the ML-GPS search is shown to provide up to 7X of speedup with relation to previous work [15], while achieving near optimal performance.…”

Section: Introductionmentioning

confidence: 93%

“…The ML-GPS framework is based on the FCUDA framework [15] (referred to as SL-GPS hereafter) which demonstrates a novel HLS-based flow for mapping coarsegrained parallelism in CUDA kernels onto spatial parallelism on reconfigurable fabric. The SPMD CUDA kernels offer a concise way for describing work to be done by multiple threads which are organized in groups called thread-blocks.…”

Section: Background and Motivationmentioning

confidence: 99%

“…Some of these programming models, such as OpenMP, streaming languages, etc., have been adopted as programming interfaces for mapping application parallelism onto FPGA [12][13][14]. Moreover, the recently introduced CUDA (Compute Unified Device Architecture) [8] programming model by NVIDIA which provides a multi-threaded SPMD model for general purpose computing on GPUs has been selected as the FPGA programming model in the FCUDA framework [15].…”

Section: Introductionmentioning

confidence: 99%

“…However, in most cases, application parallelism is extracted only from a single level of granularity (e.g. loop [13,6,7], stream pipeline [14] or procedure granularity [15]). Moreover, the impact of additional parallelism on frequency is either ignored (only cycles are reported) or dealt with via worst case synthesis conditions (i.e.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multilevel Granularity Parallelism Synthesis on FPGAs

Papakonstantinou

Liang

Stratton

et al. 2011

2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines

Self Cite

View full text Add to dashboard Cite

Abstract-Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both in terms of cycles and frequency. Evaluation of the rich design space through the full implementation flow -starting with high level source code and ending with routed netlist -is prohibitive in various scientific and computing domains, thus hindering the adoption of reconfigurable computing. This work presents a framework for multilevel granularity parallelism exploration with HLS-order of efficiency. Our framework considers different granularities of parallelism for mapping CUDA kernels onto high performance FPGA-based accelerators. We leverage resource and clock period models to estimate the impact of multi-granularity parallelism extraction on execution cycles and frequency. The proposed Multilevel Granularity Parallelism Synthesis (ML-GPS) framework employs an efficient design space search heuristic in tandem with the estimation models as well as design layout information to derive a performance near-optimal configuration. Our experimental results demonstrate that ML-GPS can efficiently identify and generate CUDA kernel configurations that can significantly outperform previous related tools whereas it can offer competitive performance compared to software kernel execution on GPUs at a fraction of the energy cost.

show abstract

Reconfigurable Computing

Boland

Chung-Kuan

Kahng

et al. 2017

Wiley Encyclopedia of Electrical and Electronics Engineering

View full text Add to dashboard Cite

Reconfigurable computing is the application of adaptable fabrics to address computational problems, often taking advantage of the flexibility of field‐programmable gate arrays (FPGAs) to produce problem‐specific solutions. It has been successfully applied to fields as diverse as machine learning, digital signal processing, cryptography, bioinformatics, logic emulation, CAD tool acceleration, scientific computing, and rapid prototyping.In this article, intended for the nonspecialist, we describe some of the basic concepts, tools, and architectures associated with reconfigurable computing.

show abstract

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Cited by 126 publications

References 11 publications

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Multilevel Granularity Parallelism Synthesis on FPGAs

Reconfigurable Computing

Contact Info

Product

Resources

About