Origami: A 803-GOp/s/W Convolutional Network Accelerator

Cavigelli, Lukas; Benini, Luca

doi:10.1109/tcsvt.2016.2592330

Cited by 178 publications

(164 citation statements)

References 42 publications

Supporting

Mentioning

164

Contrasting

Order By: Relevance

“…Numerous previous efforts [15][16][17][18][19][20][21][22][23][24][25][26] have proposed solutions for CNN acceleration, but it is difficult to compare their performance directly due to differences in implementation and design choices. In this section, we present a taxonomy of these existing CNN dataflows based on their data handling characteristics.…”

Section: Existing Cnn Dataflowsmentioning

confidence: 99%

“…Many previous papers have proposed specialized CNN dataflows on various platforms, including GPU [14], FPGA [15][16][17][18][19][20][21], and ASIC [22][23][24][25][26]. However, due to differences in technology, hardware resources and system setup, a direct comparison between different implementations does not provide much insight into the relative energy efficiency of different dataflows.…”

Section: Introductionmentioning

confidence: 99%

“…Second, direct inter-PE communication can be used very effectively for (1) passing partial sums to achieve spatially distributed accumulation, or (2) sharing the same input data for parallel computation without incurring higher energy data transfers. ASIC implementations usually deploy dozens to hundreds of PEs and specialize the PE datapath only for CNN computation [22][23][24][25][26]. FPGAs are also used to build CNN accelerators, and these designs usually use integrated DSP slices to construct the PE datapaths [15][16][17][18][19][20][21].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Chen¹,

Emer²,

Sze³

2016

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

658

900

View full text Add to dashboard Cite

Abstract-Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energyefficient CNN processing without compromising accuracy.In this paper, we present a novel dataflow, called rowstationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

show abstract

Section: Existing Cnn Dataflowsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Chen¹,

Emer²,

Sze³

2016

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

658

900

View full text Add to dashboard Cite

show abstract

“…The CPU is responsible for only receiving and sending the packets. Cavigelli et al (2015) presented a convolutional network accelerator that is scalable to network sizes that are currently handled by only workstation GPUs, but remains within the power envelope of embedded systems. It can significantly improve the external memory bottleneck of previous architectures, is more area efficient than previously reported results, and comes with the lowest-ever reported power consumption when including I/O power and external memory.…”

Section: Hardware Accelerationmentioning

confidence: 99%

Real-time pre-processing system with hardware accelerator for mobile core networks

Cheng

2017

Frontiers Inf Technol Electronic Eng

View full text Add to dashboard Cite

With the rapidly increasing number of mobile devices being used as essential terminals or platforms for communication, security threats now target the whole telecommunication infrastructure and become increasingly serious. Network probing tools, which are deployed as a bypass device at a mobile core network gateway, can collect and analyze all the traffic for security detection. However, due to the ever-increasing link speed, it is of vital importance to offload the processing pressure of the detection system. In this paper, we design and evaluate a real-time pre-processing system, which includes a hardware accelerator and a multi-core processor. The implemented prototype can quickly restore each encapsulated packet and effectively distribute traffic to multiple back-end detection systems. We demonstrate the prototype in a well-deployed network environment with large volumes of real data. Experimental results show that our system can achieve at least 18 Gb/s with no packet loss with all kinds of communication protocols.

show abstract

“…It can process a matrix multiplication at very high speed. In addition to GPUs, FPGAs [11,12,13] and specific LSIs [14,15,16] have been proposed. By utilizing a specialized hardware structure for CNN, it can achieve higher throughput and operation performance, compared to GPU based approaches.…”

Section: Introductionmentioning

confidence: 99%

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

Zhao

Wang

Liu

2017

IEICE Electron. Express

View full text Add to dashboard Cite

In this paper, we propose a CGSA (Coarse Grained Spatial Architecture) which processes different kinds of convolution with high performance and low energy consumption. The architecture's 16 coarse grained parallel processing units achieve a peak 152 GOPS running at 500 MHz by exploiting local data reuse of image data, feature map data and filter weights. It achieves 99 frames/s on the convolutional layers of the AlexNet benchmark, consuming 264 mW working at 500 MHz and 1 V. We evaluated the architecture by comparing some recent CNN's accelerators. The evaluation result shows that the proposed architecture achieves 3× energy efficiency and 3.5× area efficiency than existing work of the similar architecture and technology proposed by Chen.

show abstract

Origami: A 803-GOp/s/W Convolutional Network Accelerator

Cited by 178 publications

References 42 publications

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Real-time pre-processing system with hardware accelerator for mobile core networks

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

Contact Info

Product

Resources

About