Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming,; FerdmanMichael,; MilderPeter,

doi:10.1145/3140659.3080221

Cited by 131 publications

(151 citation statements)

References 33 publications

Supporting

Mentioning

150

Contrasting

Unclassified

Order By: Relevance

“…The approach permits the optimization of the architecture for each layer but requires other techniques such as fused layers [17] to account for the extra memory required to store intermediate maps and weights. A mid-term solution was proposed by Shen et al [18]. They mentioned the inefficiencies of a single module to run all convolutional layers, where for some layers there is an under utilization of processing elements.…”

Section: Related Workmentioning

confidence: 99%

“…The work by Gong et al [19] also proposes a fully pipelined FPGA accelerator for CNNs with 16-bit quantization and a layer-fused technique. The architecture implemented in a small ZYNQ7020 FPGA has an acceptable performance of 80 GOPs, but the complexity of the process referred by Shen et al [18] reduces the efficiency of the solution for small density FPGAs.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

et al. 2019

View full text Add to dashboard Cite

Edge devices are becoming smarter with the integration of machine learning methods, such as deep learning, and are therefore used in many application domains where decisions have to be made without human intervention. Deep learning and, in particular, convolutional neural networks (CNN) are more efficient than previous algorithms for several computer vision applications such as security and surveillance, where image and video analysis are required. This better efficiency comes with a cost of high computation and memory requirements. Hence, running CNNs in embedded computing devices is a challenge for both algorithm and hardware designers. New processing devices, dedicated system architectures and optimization of the networks have been researched to deal with these computation requirements. In this paper, we improve the inference execution times of CNNs in low density FPGAs (Field-Programmable Gate Arrays) using fixed-point arithmetic, zero-skipping and weight pruning. The developed architecture supports the execution of large CNNs in FPGA devices with reduced on-chip memory and computing resources. With the proposed architecture, it is possible to infer an image in AlexNet in 2.9 ms in a ZYNQ7020 and 1.0 ms in a ZYNQ7045 with less than 1% accuracy degradation. These results improve previous state-of-the-art architectures for CNN inference.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

et al. 2019

View full text Add to dashboard Cite

show abstract

“…However, the fixed dimensions of one computing unit could not be compatible with all the layers with different dimensions, which leads to the resource inefficiency, especially in Fully Connected (FCN) layers [19]. Some recent works [20][21][22][23][24] focus on a parallel streaming architecture, which partitions a system into several independent tasks and runs them in parallel hardware [25]. In general, the partitioning includes task level and data level.…”

Section: Introductionmentioning

confidence: 99%

“…This is the second mapping way. Shen et al [22] and Venieris et al [23] both present a resource partitioning methodology for mapping CNNs on FPGAs. It can be regarded as a trade-off approach between "one size fits all" and "one to one".…”

Section: Introductionmentioning

confidence: 99%

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Chen

Song

et al. 2019

Electronics

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of 0.17 mm 2 . It can achieve 7.03 TOPS/W energy efficiency and 4.14 TOPS/mm 2 area efficiency at 100.1 mW, which makes it a promising design for the embedded devices.

show abstract

“…As a result, more and more processing power is available to run complex models, such as CNNs, in a reasonable time frame. Furthermore, researchers are working to improve the efficiency of the CNN models (Ioannou et al, 2016;Shen et al, 2016;Zhang et al, 2016a).…”

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Cited by 131 publications

References 33 publications

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Text mining to detect indications of fraud in annual reports worldwide

Contact Info

Product

Resources

About