A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses

Paulino, Nuno; Ferreira, Joao Canas; Cardoso, João M. P.

doi:10.1145/2629468

Cited by 3 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The II is determined either by the backward connections within the segment CDFGs, i.e., data/control dependencies between iterations, or by resource availability, which if insufficient delays execution of operations to later timesteps. When targeting row-based arrays, the most straightforward way to implement loop pipelining is to instantiate one FU per CDFG node, to connect the units according to the CDFG edges, and to generate the control bits for register and FU enables at the correct times [72]. This approach does not scale well for very large CDFGs, and we have found no row-based designs that simultaneously partition the graph, while overlaping execution of multiple iterations.…”

Section: Execution Modelmentioning

confidence: 98%

“…Additionally, multiple concurrent memory accesses by the accelerator become more difficult to implement, since there is no clear way to interface with data memories or to allow concurrent accesses. Alternatively, accelerators may be loosely coupled as peripherals [34,57,72,83], using interfaces such as buses, dedicated links, or shared memory schemes [44,56,58,71,94], as shown in Figure 6(b). Although these interfaces introduce larger overheads, they avoid intrusive modifications to the host processor.…”

Section: Accelerator-host Interfacementioning

confidence: 99%

“…Row arrays are designed with the forward (i.e., downward) propagation of data in mind and thus rely on components ranging from specialized multiplexers to full crossbars. Connections may also span multiple rows [17,68,72] or may implement backward connectivity. However, more complex interconnects require more control information, introducing configuration overhead and storage overhead for the configuration words/instructions.…”

Section: Fu Interconnectionsmentioning

confidence: 99%

“…An approach that automatically generates a CLA from runtime traces is presented in References [72,73]. The accelerator is transparently used at runtime as a loosely coupled co-processor by the host MicroBlaze processor.…”

Section: Custom Loop Accelerator (Cla)mentioning

confidence: 99%

See 3 more Smart Citations

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

2020

Self Cite

View full text Add to dashboard Cite

The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6× to 5.6× are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3× and 3.9×. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

show abstract

Section: Execution Modelmentioning

confidence: 98%

Section: Accelerator-host Interfacementioning

confidence: 99%

Section: Fu Interconnectionsmentioning

confidence: 99%

Section: Custom Loop Accelerator (Cla)mentioning

confidence: 99%

See 2 more Smart Citations

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our previous work presented a binary acceleration approach in which the execution of frequently executed loops is transparently migrated at run-time to a Reconfigurable Processing Unit (RPU), a tailored co-processor [6]- [8]. To generate an application-specific RPU, the binary of the target application is profiled by an Instruction Set Simulator (ISS) to detect several megablock instruction traces [9].…”

Section: Introductionmentioning

confidence: 99%

Trace-Based Reconfigurable Acceleration with Data Cache and External Memory Support

Paulino

Ferreira

Cardoso

2014

2014 IEEE International Symposium on Parallel and Distributed Processing With Applications

View full text Add to dashboard Cite

This paper presents a binary acceleration approach based on extending a General Purpose Processor (GPP) with a Reconfigurable Processing Unit (RPU), both sharing an external data memory. In this approach repeating sequences of GPP instructions are migrated to the RPU. The RPU resources are selected and organized off-line using execution trace information. The RPU core is composed of Functional Units (FUs) that correspond to single CPU instructions. The FUs are arranged in stages of mutually independent operations. The RPU can enable several stages in tandem, depending on the data dependencies. External data memory accesses are handled by a configurable dual-port cache. A prototype implementation of the architecture on a Spartan-6 FPGA was validated with 12 benchmarks and achieved an overall geometric mean speedup of 1.91x.

show abstract

Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces

Paulino

Ferreira

Cardoso

2017

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses

Cited by 3 publications

References 16 publications

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

Trace-Based Reconfigurable Acceleration with Data Cache and External Memory Support

Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces

Contact Info

Product

Resources

About