Jonghoon Jin scite author profile

We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained. INTRODUCTIONRecent success on fast implementation of convolutional neural networks (CNNs), and new techniques such as dropout enable researchers to train large networks that were not possible before. These large CNNs show great promise in visual and audio understanding which make them useful for applications in autonomous robots, security systems, mobile phones, automobiles and wearable supports. These applications require networks with high degree of accuracies, but also networks that can be executed in real-time. However, CNNs are computationally very expensive and require high performance servers or graphics processing units (GPUs).

show abstract

Embedded Streaming Deep Neural Networks Accelerator With Applications

Dundar

Jin

Martini

et al. 2017

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Deep convolutional neural networks (DCNNs) have become a very powerful tool in visual perception. DCNNs have applications in autonomous robots, security systems, mobile phones, and automobiles, where high throughput of the feedforward evaluation phase and power efficiency are important. Because of this increased usage, many field-programmable gate array (FPGA)-based accelerators have been proposed. In this paper, we present an optimized streaming method for DCNNs' hardware accelerator on an embedded platform. The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator. The proposed method utilizes maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation. It is tested with a hardware accelerator implemented on the Xilinx Kintex-7 XC7K325T FPGA. The system fully explores weight-level and node-level parallelizations of DCNNs and achieves a peak performance of 247 G-ops while consuming less than 4 W of power. We test our system with applications on object classification and object detection in real-world scenarios. Our results indicate high-performance efficiency, outperforming all other presented platforms while running these applications.

show abstract

An efficient implementation of deep convolutional neural networks on a mobile coprocessor

Jin¹,

Gokhale²,

Dundar³

et al. 2014

View full text Add to dashboard Cite

Memory access optimized routing scheme for deep networks on a mobile coprocessor

Dundar¹,

Jin

Gokhale

et al. 2014

View full text Add to dashboard Cite

In this paper, we present a memory access op timized routing scheme for a hardware accelerated real-time implementation of deep convolutional neural networks (DCNNs) on a mobile platform. DCNNs consist of multiple la y ers of 3D convolutions, each comprising between tens and hundreds of filters and the y generate the most expensive operations in DCNNs. S y stems that run DCNNs need to pass 3D input maps to the hardware accelerators for convolutions and the y face the limitation of streaming data in and out of the hardware accelerator. The bandwidth limited s y stems require data reuse to utilize computational resources efficientl y . We propose a new routing scheme for 3D convolutions b y taking advantage of the characteristic of DCNNs to full y utilize all the resources in the hardware accelerator. This routing scheme is implemented on the Xilinx Z y nq-7000 All Programmable Soc. The s y stem full y explores weight level and node level parallelization of DCNNs and achieves a peak performance 2x better than the previous routing scheme while running DCNNs.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jonghoon Jin

A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks

Flattened Convolutional Neural Networks for Feedforward Acceleration

Embedded Streaming Deep Neural Networks Accelerator With Applications

An efficient implementation of deep convolutional neural networks on a mobile coprocessor

Memory access optimized routing scheme for deep networks on a mobile coprocessor

Contact Info

Product

Resources

About