Deep convolutional neural networks (CNNs) are the deep learning model of choice for performing object detection, classification, semantic segmentation and natural language processing tasks. CNNs require billions of operations to process a frame. This computational complexity, combined with the inherent parallelism of the convolution operation make CNNs an excellent target for custom accelerators. However, when optimizing for different CNN hierarchies and data access patterns, it is difficult for custom accelerators to achieve close to 100% computational efficiency. In this work, we present Snowflake, a scalable and efficient accelerator that is agnostic to CNN workloads, and was designed to always perform at near-peak hardware utilization. Snowflake is able to achieve a computational efficiency of over 91% on modern CNN models. Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.
In this paper, we present a memory access op timized routing scheme for a hardware accelerated real-time implementation of deep convolutional neural networks (DCNNs) on a mobile platform. DCNNs consist of multiple la y ers of 3D convolutions, each comprising between tens and hundreds of filters and the y generate the most expensive operations in DCNNs. S y stems that run DCNNs need to pass 3D input maps to the hardware accelerators for convolutions and the y face the limitation of streaming data in and out of the hardware accelerator. The bandwidth limited s y stems require data reuse to utilize computational resources efficientl y . We propose a new routing scheme for 3D convolutions b y taking advantage of the characteristic of DCNNs to full y utilize all the resources in the hardware accelerator. This routing scheme is implemented on the Xilinx Z y nq-7000 All Programmable Soc. The s y stem full y explores weight level and node level parallelization of DCNNs and achieves a peak performance 2x better than the previous routing scheme while running DCNNs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.