Karthik Gururaj scite author profile

Abstract-As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is often not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPGPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

show abstract

Accelerator-Rich Architectures

Cong

Ghodrat

Gill

et al. 2014

View full text Add to dashboard Cite

Multilevel Granularity Parallelism Synthesis on FPGAs

Papakonstantinou

Liang

Stratton

et al. 2011

View full text Add to dashboard Cite

Abstract-Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both in terms of cycles and frequency. Evaluation of the rich design space through the full implementation flow -starting with high level source code and ending with routed netlist -is prohibitive in various scientific and computing domains, thus hindering the adoption of reconfigurable computing. This work presents a framework for multilevel granularity parallelism exploration with HLS-order of efficiency. Our framework considers different granularities of parallelism for mapping CUDA kernels onto high performance FPGA-based accelerators. We leverage resource and clock period models to estimate the impact of multi-granularity parallelism extraction on execution cycles and frequency. The proposed Multilevel Granularity Parallelism Synthesis (ML-GPS) framework employs an efficient design space search heuristic in tandem with the estimation models as well as design layout information to derive a performance near-optimal configuration. Our experimental results demonstrate that ML-GPS can efficiently identify and generate CUDA kernel configurations that can significantly outperform previous related tools whereas it can offer competitive performance compared to software kernel execution on GPUs at a fraction of the energy cost.

show abstract

Synthesis of reconfigurable high-performance multicore systems

Cong

Gururaj

Han

2009

View full text Add to dashboard Cite

Reconfigurable high-performance computing systems (RHPC) have been attracting more and more attention over the past few years. RHPC systems are a promising solution for accelerating system performance, lowering power consumption and minimizing operation cost. In order to achieve high performance on this hybrid system, it is important to effectively explore the design space, which includes accelerator synthesis, resource allocation and job scheduling. In this paper we propose novel algorithms for reconfigurable resource allocation and job scheduling to optimize performance of multicore RHPC systems. Specifically, we first propose an interesting approximation algorithm to assign jobs to processors with consideration of coprocessors at the global optimization step. Then we present an optimal solution for coprocessor selection in the local optimization step. In this paper we also demonstrate that designers can quickly explore a large number of accelerator design choices with the help of high-level synthesis tools. Experiments show that our proposed techniques provide efficient solutions for real-life benchmarks and generate higher quality of results. When compared to other heuristic algorithms, our results can achieve up to 47% performance improvement.

show abstract

Synthesis Algorithm for Application-Specific Homogeneous Processor Networks

Cong

Gururaj

Han

2009

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Abstract-The application-specific multiprocessor system-on-achip is a promising design alternative because of its high degree of flexibility, short development time, and potentially high performance attributed to application-specific optimizations. However, designing an optimal application-specific multiprocessor system is still challenging because there are a number of important metrics, such as throughput, latency, and resource usage, which need to be explored and optimized. This paper addresses the problem of synthesizing an application-specific multiprocessor system for stream-oriented embedded applications to minimize system latency under the throughput constraint. We employ a novel framework for this problem, similar to that of technology mapping in the logic synthesis domain, and develop a set of efficient algorithms, including labeling and clustering for efficient generation of the multiprocessor architecture with application-specific optimized latency. Specifically, the result of our algorithm is latency-optimal for directed acyclic task graphs. Application of our approach to the Motion JPEG example on Xilinx's Virtex II Pro platform FPGA shows interesting design tradeoffs.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Karthik Gururaj

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Accelerator-Rich Architectures

Multilevel Granularity Parallelism Synthesis on FPGAs

Synthesis of reconfigurable high-performance multicore systems

Synthesis Algorithm for Application-Specific Homogeneous Processor Networks

Contact Info

Product

Resources

About