Fast and Efficient Convolutional Accelerator for Edge Computing

Ardakani, Arash; Condo, Carlo; Gross, Warren J.

doi:10.1109/tc.2019.2941875

Cited by 42 publications

(36 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During the past few years, common trends in machine learning accelerator design have included: providing higher memory bandwidth [22], efficient dataflow mapping [23], in-memory computing [24], and, skipping ineffectual computations [25].…”

Section: A Design Of Efficient Hardware Acceleratorsmentioning

confidence: 99%

“…However, as DNN grow (recently, models with hundreds of billions of parameters have been developed [26]), offchip DRAM, despite long access latency and high energy consumption, becomes indispensable. To address this, recent work incorporates a local buffer for each PE with a global buffer shared by all PEs, enabling fast, energy-efficient data accesses, as such buffers can consume up to two orders of magnitudes less energy per access than DRAM [21], [25], [27]- [35].…”

Section: A Design Of Efficient Hardware Acceleratorsmentioning

confidence: 99%

“…To improve performance efficiency, the dataflow presented in [25] uses time-multiplexed mapping: the convolutional computations of each output activation are entirely performed by one PE. This makes the convolution process independent of the shape of input activations, reducing the number of unused PEs.…”

Section: A Design Of Efficient Hardware Acceleratorsmentioning

confidence: 99%

“…2). Cnvlutin [36], Cambricon-X [37], and ZASCA [25] are examples of such zero-skipping accelerators.…”

Section: ) Eliminating Ineffectual Computationsmentioning

confidence: 99%

See 3 more Smart Citations

Hardware-Aware Design for Edge Intelligence

Gross

Meyer

Ardakani

2021

IEEE Open J. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

Section: A Design Of Efficient Hardware Acceleratorsmentioning

confidence: 99%

Section: A Design Of Efficient Hardware Acceleratorsmentioning

confidence: 99%

Section: A Design Of Efficient Hardware Acceleratorsmentioning

confidence: 99%

“…2). Cnvlutin [36], Cambricon-X [37], and ZASCA [25] are examples of such zero-skipping accelerators.…”

Section: ) Eliminating Ineffectual Computationsmentioning

confidence: 99%

See 2 more Smart Citations

Hardware-Aware Design for Edge Intelligence

Gross

Meyer

Ardakani

2021

IEEE Open J. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

“…Among them, representative dataflow techniques are the weight-stationary (WS) [9], [10], output-stationary (OS) [11], [12], row-stationary (RS) [13], [14] and no local reuse (NLR) [15], [16], [17] dataflow techniques. However, they fail to employ the full potential performance of their architectures due to the limited data bandwidth of devices [32]. The bandwidth bottleneck prevent their architectures from providing the required parallelism for their PEs immediately after each access to the off-chip memory.…”

Section: Introductionmentioning

confidence: 99%

Roofline-Model-Based Design Space Exploration for Dataflow Techniques of CNN Accelerators

Park

2020

IEEE Access

View full text Add to dashboard Cite

To effectively compute convolutional layers, a complex design space must exist (e.g., the dataflow techniques associated with the layer parameters, loop transformation techniques, and hardware parameters). For efficient design space exploration (DSE) of various dataflow techniques, namely, the weight-stationary (WS), output-stationary (OS), row-stationary (RS), and no local reuse (NLR) techniques, the processing element (PE) structure and computational pattern of each dataflow technique are analyzed. Various performance metrics are calculated, namely, the throughput (in giga-operations per second, GOPS), computation-to-communication ratio (CCR), on-chip memory usage, and off-chip memory bandwidth, as closed-form expressions of the layer and hardware parameters. In addition, loop interchange and loop unrolling techniques with a double-buffer architecture are assumed. Many roofline model-based simulations are performed to explore relevant dataflow techniques for a wide variety of convolutional layers of typical neural networks. Through simulation, this paper provides insights into the trends in accelerator performance as the layer parameters change. For convolutional layers with large input and output feature map (ifmap and ofmap) widths and heights, the GOPS of the NLR dataflow technique tends to be higher than that of the techniques. For convolutional layers with low weight and ofmap widths and heights, the RS dataflow technique achieves optimal GOPS and on-chip memory usage. In the case of convolutional layers with small weight widths and heights, the GOPS of the WS dataflow technique tends to be high. In the case of convolutional layers with small ofmap widths and heights, the OS dataflow technique achieves optimal GOPS and on-chip memory usage. INDEX TERMS Accelerator, convolutional neural networks (CNNs), dataflow techniques, roofline, simulation, processing element (PE), design space exploration (DSE), field-programmable gate array (FPGA)

show abstract

References

2023

Accelerators for Convolutional Neural Networks

View full text Add to dashboard Cite

Fast and Efficient Convolutional Accelerator for Edge Computing

Cited by 42 publications

References 34 publications

Hardware-Aware Design for Edge Intelligence

Hardware-Aware Design for Edge Intelligence

Roofline-Model-Based Design Space Exploration for Dataflow Techniques of CNN Accelerators

References

Contact Info

Product

Resources

About