Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Fleming, Shane T.; Thomas, David B.

doi:10.1109/fccm.2017.33

Cited by 5 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The generated Boogie program only describes the program behaviour of the partitioned memory for the memory arbitration problem. EASY uses the pre-existing slicing tool by Fleming and Thomas [31] to automatically extract the memory behaviour from the input code. The sliced code is a list of instructions that affects the partitioned memory access, disregarding all other irrelevant instructions in the thread function.…”

Section: Generating a Boogie Programmentioning

confidence: 99%

Efficient Memory Arbitration in High-Level Synthesis From Multi-Threaded Code

Cheng

Fleming

Chen

et al. 2022

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

High-level synthesis (HLS) is an increasingly popular method for generating hardware from a description written in a software language like C/C++. Traditionally, HLS tools have operated on sequential code, however in recent years there has been a drive to synthesise multi-threaded code. In this context, a major challenge facing HLS tools is how to automatically partition memory among parallel threads to fully exploit the bandwidth available on an FPGA device and minimise memory contention. Existing partitioning approaches require inefficient arbitration circuitry to serialise accesses to each bank because they make conservative assumptions about which threads might access which memory banks. In this article, we design a static analysis that can prove certain memory banks are only accessed by certain threads, and use this analysis to simplify or even remove the arbiters while preserving correctness. We show how this analysis can be implemented using the Microsoft Boogie verifier on top of satisfiability modulo theories (SMT) solver, and propose a tool named EASY using automatic formal verification. Our work supports arbitrary input code with any irregular memory access patterns and indirect array addressing forms. We implement our approach in LLVM and integrate it into the LegUp HLS tool. For a set of typical application benchmarks our results have shown that EASY can achieve 0.13× (avg. 0.43×) of area and 1.64× (avg. 1.28×) of performance compared to the baseline, with little additional compilation time relative to the long time in hardware synthesis.

show abstract

Section: Generating a Boogie Programmentioning

confidence: 99%

Efficient Memory Arbitration in High-Level Synthesis From Multi-Threaded Code

Cheng

Fleming

Chen

et al. 2022

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…RELISH (Runahead Execution of Load Instructions via Sliced Hardware) [36] is a LegUp HLS optimization pass which constructs a "pslice" (precomputation slice) for an accelerator. A "pslice" is an executable portion of an original program which only includes certain operations, in this case every long latency global load in the accelerated function.…”

Section: Taxonomy Of Existing Projectsmentioning

confidence: 99%

A Survey on Domain-Specific Memory Architectures

Soldavini

Pilato

2021

JICS

View full text Add to dashboard Cite

The never-ending demand for high performance and energy efficiency is pushing designers towards an increasing level of heterogeneity and specialization in modern computing systems. In such systems, creating efficient memory architectures is one of the major opportunities for optimizing modern workloads (e.g., computer vision, machine learning, graph analytics, etc.) that are extremely data-driven. However, designers demand proper design methods to tackle the increasing design complexity and address several new challenges, like the security and privacy of the data to be elaborated.This paper overviews the current trend for the design of domain-specific memory architectures. Domain-specific architectures are tailored for the given application domain, with the introduction of hardware accelerators and custom memory modules while maintaining a certain level of flexibility. We describe the major components, the common challenges, and the state-of-the-art design methodologies for building domain-specific memory architectures. We also discuss the most relevant research projects, providing a classification based on our main topics.

show abstract

“…Unrolling a loop increases the number of operations that use the same memory, turning memories into bottlenecks since the compiler does not infer more memories or more ports to the existing ones. To avoid this problem, memory buffers [34,35], partition [36], or run-ahead [37] techniques can be applied.…”

Section: Scalabilitymentioning

confidence: 99%