Block convolution: Towards memory-efficient inference of large-scale CNNs on FPGA

Li, Gang; Li, Fanrong; Zhao, Tianli; Cheng, Jian

doi:10.23919/date.2018.8342188

Cited by 22 publications

(30 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the majority of the computations in a network are matrix-matrix/matrix-vector multiplication, it is critical to deal with the massive nested loops to achieve high throughput. Loop optimization is one of the most frequently adopted techniques in accelerator design [92,56,73,2,88,49], including loop tiling, loop unrolling, loop interchange, etc. Loop tiling is used to divide all of the data into multiple small blocks in order to alleviate the pressure of onchip storage [56,2,64], while loop unrolling attempts to improve the parallelism of the computing engine for high speed [56,64].…”

Section: Optimizing For High Throughputmentioning

confidence: 99%

See 1 more Smart Citation

Recent advances in efficient computation of deep convolutional neural networks

Cheng

Wang

et al. 2018

Frontiers Inf Technol Electronic Eng

Self Cite

210

View full text Add to dashboard Cite

Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems. At the same time, the computational complexity and resource consumption of these networks also continue to increase. This will pose a significant challenge to the deployment of such networks, especially in real-time applications or on resource-limited devices. Thus, network acceleration has become a hot topic within the deep learning community. As for hardware implementation of deep neural networks, a batch of accelerators based on FPGA/ASIC have been proposed in recent years. In this paper, we provide a comprehensive survey of recent advances in network acceleration, compression and accelerator design from both algorithm and hardware points of view. Specifically, we provide a thorough analysis of each of the following topics: network pruning, low-rank approximation, network quantization, teacher-student networks, compact network design and hardware accelerators. Finally, we will introduce and discuss a few possible future directions.

show abstract

Section: Optimizing For High Throughputmentioning

confidence: 99%

“…[70] designed a flexible data buffing scheme to reduce bandwidth requirements, and [2] and [88] proposed a fusion-based method to reduce off-chip traffic. Most recently, [49] presented a block-based convolution that can completely avoid offchip transfers of intermediate data in VGG-16 with high throughput.…”

Section: Optimizing For Low Energy Consumptionmentioning

confidence: 99%

Recent advances in efficient computation of deep convolutional neural networks

Cheng

Wang

et al. 2018

Frontiers Inf Technol Electronic Eng

Self Cite

210

View full text Add to dashboard Cite

show abstract

“…However, as discussed in [33], in state-of-the-art deep CNNs, CONVs consume most of the computational time, thus becoming one of the most critical tasks responsible for limiting reachable speed performances. For this reason, the design of hardware parallel convolutional engines suitable for the inference of deep CNNs in high-performance low-power applications has recently received a great deal of attention [26][27][28][29][30][31]34]. The most exploited design techniques aim to boost the achievable performances by increasing the level of parallelism with which data is processed [28][29][30][31]34].…”

Section: Background and Motivationsmentioning

confidence: 99%

“…For this reason, the design of hardware parallel convolutional engines suitable for the inference of deep CNNs in high-performance low-power applications has recently received a great deal of attention [26][27][28][29][30][31]34]. The most exploited design techniques aim to boost the achievable performances by increasing the level of parallelism with which data is processed [28][29][30][31]34]. Indeed, as is visible in Figure 1a, most of the computations involved in a convolutional layer are independent from each other, offering the possibility of parallelizing the operations within the kernel and across both ifmaps and ofmaps.…”

Section: Background and Motivationsmentioning

confidence: 99%

“…This allows energy efficiencies higher than traditional GPUs to be achieved [25]. In the literature, several FPGA-based designs have been presented to accelerate the inference of 16- [26][27][28][29] and 8-bit [30,31] fixed-point quantized CNNs. However, while some of them [28][29][30][31] were optimized to achieve high performanc by directly increasing the level of parallelism with which feature maps are processed, others [26,27] were mainly oriented towards low-power applications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA

Spagnolo

Perri

Frustaci

et al. 2019

JLPEA

View full text Add to dashboard Cite

Due to the huge requirements in terms of both computational and memory capabilities, implementing energy-efficient and high-performance Convolutional Neural Networks (CNNs) by exploiting embedded systems still represents a major challenge for hardware designers. This paper presents the complete design of a heterogeneous embedded system realized by using a Field-Programmable Gate Array Systems-on-Chip (SoC) and suitable to accelerate the inference of Convolutional Neural Networks in power-constrained environments, such as those related to IoT applications. The proposed architecture is validated through its exploitation in large-scale CNNs on low-cost devices. The prototype realized on a Zynq XC7Z045 device achieves a power efficiency up to 135 Gops/W. When the VGG-16 model is inferred, a frame rate up to 11.8 fps is reached.

show abstract

Artificial Neural Networks in Hardware

Zheng¹,

Mazumder²

2019

Learning in Energy‐Efficient Neuromorphic Computing

View full text Add to dashboard Cite

Block convolution: Towards memory-efficient inference of large-scale CNNs on FPGA

Cited by 22 publications

References 19 publications

Recent advances in efficient computation of deep convolutional neural networks

Recent advances in efficient computation of deep convolutional neural networks

Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA

Artificial Neural Networks in Hardware

Contact Info

Product

Resources

About