OpenMP Device Offloading to FPGAs Using the Nymble Infrastructure

Huthmann, Jens; Sommer, Lukas; Podobas, Artur; Koch, Andreas; Sano, Kentaro

doi:10.1007/978-3-030-58144-2_17

Cited by 6 publications

(4 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This typically leads to very high compile times and very low FPGA occupation and performance, since CPU-and GPU-optimized code is notably inefficient in the FPGA architectures. Further work by Knaust [13] and Huthmann [14] attack this problem in different ways. The first one opts to prototype the FPGA device with OpenCL and compiler-specific interfaces, requiring IR (Intermediate Representation) backporting to make use of the HLS system and OpenCL interfaces.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

FOTV: A Generic Device Offloading Framework for OpenMP

Vázquez

Sánchez

2021

OpenMP: Enabling Massive Node-Level Parallelism

View full text Add to dashboard Cite

Since the introduction of the “target” directive in the 4.0 specification, the usage of OpenMP for heterogeneous computing programming has increased significantly. However, the compiler support limits its usage because the code for the accelerated region has to be generated in compile time. This restricts the usage of accelerator-specific design flows (e.g. FPGA hardware synthesis) and the support of new devices that typically requires extending and modifying the compiler itself.This paper explores a solution to these limitations: a generic device that is supported by the OpenMP compiler but whose functionality is defined at runtime. The generic device framework has been integrated in an OpenMP compiler (LLVM/Clang). It acts as a device type for the compiler and interfaces with the physical devices to execute the accelerated code. The framework has an API that provides support for new devices and accelerated code without additional OpenMP compiler modifications. It also includes a code generator that extracts the source code of OpenMP target regions for external compilation chains.In order to evaluate the approach, we present a new device implementation that allows executing OpenCL code as an OpenMP target region. We study the overhead that the framework produces and show that it is minimal and comparable to other OpenMP devices.

show abstract

Section: Related Workmentioning

confidence: 99%

“…More flexible than [13,14] is the aforementioned Yviquel et al [10] work. It does not generate target binary code but rather a Scala implementation (as a Java runtime binary) to be ran on any Apache Spark cluster.…”

Section: Related Workmentioning

confidence: 99%

FOTV: A Generic Device Offloading Framework for OpenMP

Vázquez

Sánchez

2021

OpenMP: Enabling Massive Node-Level Parallelism

View full text Add to dashboard Cite

show abstract

“…The work by Huthmann et al in [62] presents an approach to OpenMP device offloading for FPGAs based on the LLVM compiler infrastructure and the Nymble HLS compiler. The automatic compilation flow uses LLVM IR for HLS-specific optimizations and transformation and for the interaction with the Nymble HLS compiler.…”

Section: Related Workmentioning

confidence: 99%

“…The authors in [62] argue that scaling OpenMP onto multiple FPGAs is an open question. They suggest that one could rely on OpenMP's accelerator directives, and treat each device as a discrete system with little to no access to other systems and create/include special hardware to (for example) support a shared-memory view across multiple FPGAs, or use tasks as containers that encapsulate produced/consumed data, that are exchanged among FPGAs.…”

Section: Chapter 6 Final Remarks and Conclusionmentioning

confidence: 99%

Enabling OpenMP task parallelism on Multi-FPGAs

Santos Nepomuceno

View full text Add to dashboard Cite

Os aceleradores de hardware baseados em FPGA têm recebido uma crescente atenção nos últimos anos. Um dos principais motivos para isso é que seus recursos reconfiguráveis facilitam a adaptação do acelerador a diferentes tipos de cargas de trabalho. Em alguns casos, os aceleradores baseados em FPGA fornecem maior desempenho computacional e eficiência energética. Foi relatado que o offload para FPGA alcança um melhor desempenho quando comparado GPUs e CPUs para algumas aplicações, como Fast Fourier Transform. Esse desempenho pode ser ainda maior se conectarmos vários FPGAs criando um cluster Multi-FPGA. No entanto, programar esses sistemas heterogêneos é um empreendimento desafiador que ainda requer esforços de pesquisa e desenvolvimento para torná-lo realmente simples. O modelo de programação baseado em tarefas OpenMP é uma boa escolha para programar sistemas Multi-FPGA Heterogêneos. Isso advém da capacidade deste modelo de expor um grau mais alto de paralelismo que combina: (a) offload de computação para aceleradores; (b) dependências explícitas de dados; e (c) definição das regiões de código para cada dispositivo específico. Com base nisso, o trabalho desta tese estende a infraestrutura LLVM/OpenMP existente, bem como propõe uma plataforma de hardware para ajudar o programador a expressar facilmente o offload e o uso de IP-cores disponíveis em um código binário reconfigurável pré-existente (bitstream). Para isto, foi utilizada uma metodologia de co-design, que implementa a arquitetura de hardware e software em paralelo. No lado do software, duas modificações principais foram feitas na implementação do OpenMP: (a) construir um plugin VC709 na biblioteca libomptarget; e (b) modificar o algoritmo de tempo de execução que gerencia o grafo de tarefas. No lado do hardware, foi criada uma infraestrutura baseada no Target Reference Design (TRD) da placa Xilinx VC709. As principais contribuições desta tese são: (a) um novo plugin Clang/LLVM que entende placas FPGA como dispositivos OpenMP, e usa a diretiva OpenMP declare variant para especificar IPs-cores de hardware; (b) Um mecanismo baseado na dependência de tarefa OpenMP e no modelo de transferência de computação que permite a comunicação transparente de IPs-cores em uma arquitetura Multi-FPGA; (c) Uma arquitetura em hardware, baseada no Target Reference Design da placa VC709, capaz de executar tarefas OpenMP utilizando preexistentes IP-cores; (d) Um modelo de programação baseado em paralelismo de tarefas OpenMP, que torna simples mover dados entre FPGAs, CPUs ou outros dispositivos de aceleração (por exemplo, GPUs), e que permite ao programador usar um único modelo de programação para executar sua aplicação em uma verdadeira arquitetura heterogênea. Resultados experimentais para um conjunto de aplicações de stencil em OpenMP que executaram em uma plataforma Multi-FPGA com 6 placas Xilinx VC709 interconectadas através de links de fibra ótica, mostraram acelerações quase lineares conforme o número de FPGAs e IP-cores por FPGA aumenta.

show abstract