2014
DOI: 10.1145/2629468
|View full text |Cite
|
Sign up to set email alerts
|

A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses

Abstract: This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruction traces and generates customized RPU implementations. The implementation of Megablocks with memory accesses uses a memory-sharing mechanism to support concurrent … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(9 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…The II is determined either by the backward connections within the segment CDFGs, i.e., data/control dependencies between iterations, or by resource availability, which if insufficient delays execution of operations to later timesteps. When targeting row-based arrays, the most straightforward way to implement loop pipelining is to instantiate one FU per CDFG node, to connect the units according to the CDFG edges, and to generate the control bits for register and FU enables at the correct times [72]. This approach does not scale well for very large CDFGs, and we have found no row-based designs that simultaneously partition the graph, while overlaping execution of multiple iterations.…”
Section: Execution Modelmentioning
confidence: 98%
See 3 more Smart Citations
“…The II is determined either by the backward connections within the segment CDFGs, i.e., data/control dependencies between iterations, or by resource availability, which if insufficient delays execution of operations to later timesteps. When targeting row-based arrays, the most straightforward way to implement loop pipelining is to instantiate one FU per CDFG node, to connect the units according to the CDFG edges, and to generate the control bits for register and FU enables at the correct times [72]. This approach does not scale well for very large CDFGs, and we have found no row-based designs that simultaneously partition the graph, while overlaping execution of multiple iterations.…”
Section: Execution Modelmentioning
confidence: 98%
“…Additionally, multiple concurrent memory accesses by the accelerator become more difficult to implement, since there is no clear way to interface with data memories or to allow concurrent accesses. Alternatively, accelerators may be loosely coupled as peripherals [34,57,72,83], using interfaces such as buses, dedicated links, or shared memory schemes [44,56,58,71,94], as shown in Figure 6(b). Although these interfaces introduce larger overheads, they avoid intrusive modifications to the host processor.…”
Section: Accelerator-host Interfacementioning
confidence: 99%
See 2 more Smart Citations
“…Our previous work presented a binary acceleration approach in which the execution of frequently executed loops is transparently migrated at run-time to a Reconfigurable Processing Unit (RPU), a tailored co-processor [6]- [8]. To generate an application-specific RPU, the binary of the target application is profiled by an Instruction Set Simulator (ISS) to detect several megablock instruction traces [9].…”
Section: Introductionmentioning
confidence: 99%