Efficient Hardware Design of Iterative Stencil Loops

Rana, Vincenzo; Beretta, Ivan; Bruschi, Francesco; Nacci, Alessandro Antonio; Atienza, David; Sciuto, D.

doi:10.1109/tcad.2016.2545408

Cited by 3 publications

(8 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we provide an example that illustrates why the DCMI acceleration strategy is superior to current state-of-the-art FPGA-based accelerators [4,40]. We first explain the operation of the stencil compute kernel in Section 2.1.…”

Section: Explaining the Efficiency Of Dcmimentioning

confidence: 99%

“…The two main sources of parallelism in ISLs are spatial and temporal parallelism. Broadly, two main approaches have been proposed for exploiting both forms of parallelism in FPGA-based accelerators, and Streaming Time-Steps (SST) [4] and the Cone-based Architecture (CA) [40] are representative of these two strategies. In this section, we show that both SST and CA are inefficient, because they perform redundant computation and use OCM inefficiently.…”

Section: State-of-the-art Isl Acceleration Approaches: Sst and Camentioning

confidence: 99%

“…In this section, we show that both SST and CA are inefficient, because they perform redundant computation and use OCM inefficiently. We refer to the number of time-steps that an acceleration scheme harvests parallelism from as the iteration depth D. Figure 2 shows how SST [4] and CA [40] would accelerate a 1D three-point stencil. SST exploits temporal parallelism by instantiating an accelerator core for each time-step and connecting the output of one core to the input of the next core.…”

Section: State-of-the-art Isl Acceleration Approaches: Sst and Camentioning

confidence: 99%

“…First, the accelerator can exploit that inner-loop computations can be performed in parallel as long as an output element is not used as an input element within the same time-step. Intra-time-step dependencies-or spatial dependencies-can commonly be avoided by using a double-buffering strategy [40]. Second, the accelerator can exploit that the same computation is carried out in successive time-steps (i.e., several iterations of the outer loop).…”

Section: Introductionmentioning

confidence: 99%

“…The second architectural approach is exemplified by the Cone-based Architecture (CA) [40]. CA exploits that the input elements of a single output element forms a cone that expands through prior iterations.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Dcmi

Koraei

Fatemi

Jahre

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with Field Programmable Gate Arrays, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of Multiple Iterations (DCMI) that improves upon prior work by pre-computing the effective stencil coefficients after a number of iterations at design time-resulting in accelerators that use minimal on-chip memory and avoid redundant computation. This enables DCMI to improve throughput by up to 7.7× compared to the state-of-the-art cone-based architecture. CCS Concepts: • Computer systems organization → Architectures; • Computing methodologies → Parallel computing methodologies;

show abstract

Section: Explaining the Efficiency Of Dcmimentioning

confidence: 99%

Section: State-of-the-art Isl Acceleration Approaches: Sst and Camentioning

confidence: 99%