A Hardware Runtime for Task-Based Programming Models

Tan, Xubin; Bosch, Jaume; Martínez, Carlos Álvarez; Jiménez-González, Daniel; Ayguadé, Eduard; Valero, Mateo

doi:10.1109/tpds.2019.2907493

Cited by 16 publications

(26 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By encapsulating the loop body and loads/stores in different functions, the HLS compiler is able to schedule calls without dependencies in the same cycle. In the example [8], a2 [8]; float b1 [8], b2 [8]; float c1 [8], c2 [8]; load(a1,b1,a,b); //n is multiple of 2 and n >= 2 for (int k = 0; k < n-2; ++k) { loadStore(a2,b2,c2,a+k * 8,b+k * 8,c+(k-1) * 8,k); loopBody(a1,b1,c1); ++k; loadStore(a1,b1,c1,a+k * 8,b+k * 8,c+(k-1) * 8,k); loopBody(a2,b2,c2); } int k = n-1; loadStore(a2,b2,c2,a+k * 8,b+k * 8,c+(k-1) * 8,k); loopBody(a1,b1,c1); store(c1,c+k * 8); loopBody(a2,b2,c2); store(c1,c+k * 8); } Listing 3: Proposal of OmpSs pragma syntax (vectorAdd) and generated Vivado HLS code (vectorAddTransformed) to pipeline loads/stores with computation of listing 3, the first loadStore function call of vectorAddTransformed is scheduled alongside the first loopBody call. The other two calls are also scheduled together after the first two.…”

Section: Compiler Transformationsmentioning

confidence: 99%

OmpSs@FPGA framework for high performance FPGA computing

Haro

Bosch

Filgueras

et al. 2021

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

This paper presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA is the extension of the programming model addressed specifically to FPGAs. OmpSs environment is built on top of Mercurium source to source compiler and Nanos++ runtime system. To address FPGA specifics Mercurium compiler implements several FPGA related features as local variable caching, wide memory accesses or accelerator replication. In addition, part of the Nanos++ runtime has been ported to hardware. Driven by the compiler this new hardware runtime adds new features to FPGA codes, such as task creation and dependence management, providing both performance increases and ease of programming. To demonstrate these new capabilities, different high performance benchmarks have been evaluated over different FPGA platforms using the OmpSs programming model. The results demonstrate that programs that use the OmpSs programming model achieve very competitive performance with low to moderate porting effort compared to other FPGA implementations.

show abstract

Section: Compiler Transformationsmentioning

confidence: 99%

OmpSs@FPGA framework for high performance FPGA computing

Haro

Bosch

Filgueras

et al. 2021

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Picos [18,20,24] is the module responsible for providing fast Task Scheduling functionality. Its communication interface includes queues for (1) receiving information about new tasks to be added to the task graph, called submission queue;…”

Section: Picosmentioning

confidence: 99%

“…As a result, several research groups have sought to improve the maximum throughput of Task Scheduling systems by resorting to hardware acceleration, leading to largely successful designs [8,18,20,24]. For example, the Picos [20] Task Scheduling accelerator was proven capable of significantly improving the performance of task parallel programs.…”

Section: Introductionmentioning

confidence: 99%

Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor

Morais

Silva

Goldman

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

Task Parallelism is a parallel programming model that provides code annotation constructs to outline tasks and describe how their pointer parameters are accessed so that they might be executed in parallel, and asynchronously, by a runtime capable of inferring and honoring their data dependence relationships. It is supported by several parallelization frameworks, as OpenMP and StarSs.Overhead related to automatic dependence inference and to the scheduling of ready-to-run tasks is a major performance limiting factor of Task Parallel systems. To amortize this overhead, programmers usually trade the higher parallelism that could be leveraged from finer-grained work partitions for the higher runtime-efficiency of coarser-grained work partitions. Such problems are even more severe for systems with many cores, as the task spawning frequency required for preserving cores from starvation grows linearly with their number.To mitigate these problems, researchers have designed hardware accelerators to improve runtime performance. Nevertheless, the high CPU-accelerator communication overheads of these solutions hampered their gains.We thus propose a RISC-V based architecture that minimizes communication overhead between the HW Task Scheduler and the CPU by allowing Task Scheduling software to directly interact with the former through custom instructions. Empirical evaluation of the architecture is made possible by an FPGA prototype featuring an eight-core Linux-capable Rocket Chip implementing such instructions.To evaluate the prototype performance, we both (1) adapted Nanos, a mature Task Scheduling runtime, to benefit from the new task-scheduling-accelerating instructions; and (2) developed Phentos, a new HW-accelerated light weight Task Scheduling runtime. Our experiments show that task parallel programs using Nanos-RV -the Nanos version ported to our system -are on average 2.13 times faster than those being serviced by baseline Nanos, while programs running on Phentos are 13.19 times faster, considering geometric means. Using eight cores, Nanos-RV is able to deliver speedups with respect to serial execution of up to 5.62 times, while Phentos produces speedups of up to 5.72 times.

show abstract

“…More details about Picos might be found in related publications [Tan, 2018, Tan et al, 2017, Yazdanpanah et al, 2015.…”

Section: Picosmentioning

confidence: 99%

“…As a result, several research groups have sought to improve the maximum throughput of Task Scheduling systems by resorting to hardware accelerators (e.g. FPGA), leading to largely successful designs [Dallou and Juurlink, 2012, Tan, 2018, Tan et al, 2017, Yazdanpanah et al, 2015.…”

Section: Introductionmentioning

confidence: 99%

Adding native support for task scheduling to a Linux-capable RISC-V multicore system

Morais¹

View full text Add to dashboard Cite

Paralelismo por Tarefas é uma técnica genérica de extração de paralelismo de granularidade arbitrária aplicável a programas de vários domínios, com mínimo impacto sobre legibilidade de código, baseada na inferência automática de dependências de dados entre tarefas. O desempenho de aplicações paralelas baseadas nesse paradigma depende da velocidade com a qual o runtime de Paralelismo por Tarefas que lhe dá suporte é capaz de detectar tais dependências, fato que é ainda mais crítico para aplicações envolvendo tarefas de granularidade fina, já que nesse cenário o overhead de escalonamento não é amortizado por períodos significativamente maiores de computação útil. Recentemente, diversos grupos têm desenvolvido Sistemas de Suporte a Paralelismo por Tarefas acelerados por FPGAs, os quais são capazes de fazer offload das operações de inferência de dependências para um acelerador em FPGA de modo a melhorar o seu desempenho ao lidar com tarefas de granularidade fina. Por outro lado, ainda que esses sistemas acelerados por FPGA apresentem ganhos substanciais com relação às alternativas baseadas puramente em software, o desempenho dessas soluções é prejudicado por gargalos de comunicação entre a CPU e a FPGA, os quais limitam a capacidade desses sistemas de lidar com cenários envolvendo grande número de núcleos ou tarefas muito finas. Motivados por isso, implementamos um Sistema de Suporte Nativo a Paralelismo por Tarefasisto é, um processador com suporte arquitetural nativo a Paralelismo por Tarefas-com o objetivo de reduzir consideravelmente tais overheads de comunicação. Mais especificamente, integramos a lógica em hardware do Picos, um acelerador de Paralelismo por Tarefas desenvolvido pelo Barcelona Supercomputing Center (BSC), ao Rocket Chip, uma implementação multi-core de código livre do RISC-V desenvolvida pela Universidade da Califórnia, Berkeley. O sistema resultante contém em sua ISA (Instruction Set Architecture) as instruções necessárias para que aplicações baseadas em tarefas possam interagir diretamente com essa lógica de escalonamento, minimizando os overheads associados ao uso de runtimes intermediários e eliminando toda a latência de comunicação FPGA-CPU. Para avaliar a performance do protótipo que então se construiu, nós tanto (1) adaptamos o runtime de escalonamento de tarefas Nanos para que ele pudesse ser acelerado pelas novas instruções de escalonamento de tarefas, quanto (2) criamos um novo runtime leve de escalonamento de tarefas a que demos o nome de Phentos. Nossos experimentos mostram que programas baseados em paralelismo por tarefas usando o runtime Nanos-RV-a versão do runtime Nanos com suporte ao nosso sistema que produzimos-são executados em média 2,13 vezes mais rapidamente do que versões dos mesmos programas utilizando a versão básica do Nanos, enquanto programas executados com o Phentos são em média 13,19 vezes mais rápidos do que suas versões correspondentes baseadas na mesma versão básica do Nanos. Tais valores médios correspondem à média geométrica dos conjuntos de dados pertinentes. ...

show abstract

A Hardware Runtime for Task-Based Programming Models

Cited by 16 publications

References 22 publications

OmpSs@FPGA framework for high performance FPGA computing

OmpSs@FPGA framework for high performance FPGA computing

Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor

Adding native support for task scheduling to a Linux-capable RISC-V multicore system

Contact Info

Product

Resources

About