A Highly-Efficient and Tightly-Connected Many-Core Overlay Architecture

Abdelhamid, Riadh Ben; Yamaguchi, Yoshiki; Boku, Taisuke

doi:10.1109/access.2021.3074171

Cited by 11 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2GRVI Phalanx [59] extends that to more than 1000 64-bit RISC-V cores on a Xilinx VU37P. The DRAGON architecture [60] is a 64-bit custom-ISA cluster-based multiprocessor that scales to 144 cores on a Xilinx VU37P. In contrast, the accelerator in HEROv2 is not specialized for FPGAs but has identical RTL code as for ASIC tapeouts.…”

Section: Related Workmentioning

confidence: 99%

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Kurth¹,

Forsberg²,

Benini³

2022

Preprint

View full text Add to dashboard Cite

Heterogeneous computers integrate general-purpose host processors with domain-specific accelerators to combine versatility with efficiency and high performance. To realize the full potential of heterogeneous computers, however, many hardware and software design challenges have to be overcome. While architectural and system simulators can be used to analyze heterogeneous computers, they are faced with unavoidable compromises between simulation speed and performance modeling accuracy. In this work we present HEROv2, an FPGA-based research platform that enables accurate and fast exploration of heterogeneous computers consisting of accelerators based on clusters of 32-bit RISC-V cores and an application-class 64-bit ARMv8 or RV64 host processor. HEROv2 allows to seamlessly share data between 64-bit hosts and 32-bit accelerators and comes with a fully open-source on-chip network, a unified heterogeneous programming interface, and a mixed-data-model, mixed-ISA heterogeneous compiler based on LLVM. We evaluate HEROv2 in four case studies from the application level over toolchain and system architecture down to accelerator microarchitecture. We demonstrate how HEROv2 enables effective research and development on the full stack of heterogeneous computing. For instance, the compiler can tile loops and infer data transfers to and from the accelerators, which leads to a speedup of up to 4.4× compared to the original program and in most cases is only 15 % slower than a handwritten implementation, which requires 2.6× more code.

show abstract

Section: Related Workmentioning

confidence: 99%

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Kurth¹,

Forsberg²,

Benini³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The survey done in [8] highlights the fact that many overlay tools have been developed in both classes; we can mention DeCO [9] for SC overlays; GRVI Phalanx [10], and reMORPH [11] for TM overlays. The work done in [12] highlights a list of some previous parallel processing overlays.…”

Section: Introductionmentioning

confidence: 99%

“…All of the tools presented in [12] integrate parallel computing models; however, we have not encountered an FPGA overlay tool that addresses the design of MPI parallel applications without the intervention of CPUs. MPI parallelization compared to OpenMP has advantages of no parallelization overhead, except for the explicit communications that have been added to the program once the MPI parallel program has been configured; moreover, all aspects of MPI programs are generally executed in parallel, unlike OpenMP [13].…”

Section: Introductionmentioning

confidence: 99%

An efficient FPGA overlay for MPI-2 RMA parallel applications

Leonel

Ewo

Denoulet³

et al. 2022

2022 20th IEEE Interregional NEWCAS Conference (NEWCAS)

View full text Add to dashboard Cite

Design productivity issues, including difficult hardware design and long compile times, are major barriers to the widespread adoption of FPGA-based accelerations in mainstream computing. Enabling virtualized execution of software and hardware tasks on FPGA platforms would make them more accessible to application developers accustomed to software API abstractions such as MPI and fast development cycles. In this work, we show that the MATIP platform provides a viable and efficient FPGA overlay architecture for the design of MPI parallel applications. We support this with a parallel model implementation of a feature extraction algorithm for tone language recognition, which is shown to be at least 7 times more efficient than a C++ MPI-2 RMA implementation of the same parallel model on a CPU and almost 3 times more efficient than a naive FPGA IP implementation.

show abstract