2009 IEEE 7th Symposium on Application Specific Processors 2009
DOI: 10.1109/sasp.2009.5226333
|View full text |Cite
|
Sign up to set email alerts
|

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Abstract: Abstract-As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
70
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 126 publications
(73 citation statements)
references
References 11 publications
0
70
0
Order By: Relevance
“…Overall, VThreads with its hardware-assisted PThreads support demonstrates better performance to the Leon3MP system particularly for highly-parallel workloads such as Mandelbrot and Sobel filter; the pattern on more complex benchmarks such as JPEG decode and DES is slightly different with the Leon3MP system demonstrating better performance at higher core counts (8). This can be attributed to the very fast customPThreads implementation in which all cores are active in a very tight polling loop in a shared-memory system whereas in VThreads, the DBG_IF PThreads mechanism in Section 3.6 is a natural synchronization point which can be further optimized if implemented in a pipelined fashion such as the MPI coprocessors of [31].…”
Section: Discussion Of Resultsmentioning
confidence: 98%
“…Overall, VThreads with its hardware-assisted PThreads support demonstrates better performance to the Leon3MP system particularly for highly-parallel workloads such as Mandelbrot and Sobel filter; the pattern on more complex benchmarks such as JPEG decode and DES is slightly different with the Leon3MP system demonstrating better performance at higher core counts (8). This can be attributed to the very fast customPThreads implementation in which all cores are active in a very tight polling loop in a shared-memory system whereas in VThreads, the DBG_IF PThreads mechanism in Section 3.6 is a natural synchronization point which can be further optimized if implemented in a pipelined fashion such as the MPI coprocessors of [31].…”
Section: Discussion Of Resultsmentioning
confidence: 98%
“…Thus the ML-GPS framework can efficiently complete the design space exploration within minutes (rather than days if synthesis and physical implementation were used). More importantly, the design space point selected by the ML-GPS search is shown to provide up to 7X of speedup with relation to previous work [15], while achieving near optimal performance.…”
Section: Introductionmentioning
confidence: 93%
“…The ML-GPS framework is based on the FCUDA framework [15] (referred to as SL-GPS hereafter) which demonstrates a novel HLS-based flow for mapping coarsegrained parallelism in CUDA kernels onto spatial parallelism on reconfigurable fabric. The SPMD CUDA kernels offer a concise way for describing work to be done by multiple threads which are organized in groups called thread-blocks.…”
Section: Background and Motivationmentioning
confidence: 99%
See 2 more Smart Citations