PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

Deng, Chunhua; Liao, Siyu; Xie, Yi; Parhi, Keshab K.; Qian, Xuehai; Yuan, Bo

doi:10.1109/micro.2018.00024

Cited by 85 publications

(68 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Computation dataflow. The PE operations in prior sparse CNN implementation [6], [13], [16] are based on vectormatrix and vector-vector matrix multiplications. However, these operations need to re-gather the sparse weights into a new vector or matrix, resulting in the overhead of matching the index between the vector and the matrix.…”

Section: Dataflow and Pe Designmentioning

confidence: 99%

“…The evolution of DNNs has already piqued interest in hardware acceleration as both DNNs training and inference demand a tremendous amount of computation. As a result, hardware accelerators such as GPUs, FPGAs, and customized ASICs have been employed to accelerate DNNs [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25]. However, DNN designers are still hampered by the growing complexity of DNN models.…”

Section: Introductionmentioning

confidence: 99%

“…Another observation is that these software-driven structured pruning techniques will likely fall short for modern CNNs with complex neural network topology due to the complexity of training. For example, CirCNN [35] only succeed in small-scale networks such as LeNet, AlexNet, and PERMDNN [13] is only applicable to FC layer, without validation on modern CNNs such as Resnet.…”

Section: Introductionmentioning

confidence: 99%

“…The advantage of OMNI is that it forgoes the strict sparsity pattern employed by prior structural pruning techniques [4], [13], [35]. Instead, ONMI exploits patterns from existing hardware amenable memory partition techniques.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Liang

Xie

2021

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Convolution neural networks (CNNs) as one of today's main flavor of deep learning techniques dominate in various image recognition tasks. As the model size of modern CNNs continues to grow, neural network compression techniques have been proposed to prune the redundant neurons and synapses. However, prior techniques disconnect the software neural networks compression and hardware acceleration, which fail to balance multiple design parameters including sparsity, performance, hardware area cost, and efficiency. More concretely, prior unstructured pruning techniques achieve high sparsity at the expense of extra performance overhead, while prior structured pruning techniques relying on strict sparse patterns lead to low sparsity and extra hardware cost. In this paper, we propose OMNI, a framework for accelerating sparse CNNs on hardware accelerators. The innovation of OMNI stems from that it uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration. To accelerate the compute-intensive convolution kernel, a promising hardware optimization approach is memory partition, which divides the original weight kernels into several groups so that the different hardware processing elements can simultaneously access the weight. We exploit the memory partition patterns including block, cyclic, or hybrid as a means of CNN compression patterns. Our software CNN model compression balances the sparsity across different groups and our hardware accelerator employs hardware parallelization coordinately with the sparse patterns, leading to a desirable compromise between sparsity and performance. We further develop performance models to help the designers to quickly identify the pattern factors subject to an area constraint. Last, we evaluate our design on Application Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA) platform. Experiments demonstrate that OMNI achieves 3.4x-6.2x speedup for the modern CNNs, over a comparably ideal dense CNN accelerator. OMNI shows 114.7x energy efficiency improvement compared with GPU platform. OMNI is also evaluated on Xilinx ZC706 and ZCU102 FPGA platforms, achieving 41.5 GOP/s and 125.3 GOP/s, respectively.

show abstract

Section: Dataflow and Pe Designmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Liang

Xie

2021

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…Pruning or compressing an already-trained DNN could result in large approximation error [54][55][56][57]. One alternative is to train a sparse DNN.…”

Section: Algorithmic Designmentioning

confidence: 99%

Deep Learning on Computational-Resource-Limited Platforms: A Survey

Chen

Zhang

et al. 2020

Mobile Information Systems

View full text Add to dashboard Cite

Nowadays, Internet of Things (IoT) gives rise to a huge amount of data. IoT nodes equipped with smart sensors can immediately extract meaningful knowledge from the data through machine learning technologies. Deep learning (DL) is constantly contributing significant progress in smart sensing due to its dramatic superiorities over traditional machine learning. The promising prospect of wide-range applications puts forwards demands on the ubiquitous deployment of DL under various contexts. As a result, performing DL on mobile or embedded platforms is becoming a common requirement. Nevertheless, a typical DL application can easily exhaust an embedded or mobile device owing to a large amount of multiply and accumulate (MAC) operations and memory access operations. Consequently, it is a challenging task to bridge the gap between deep learning and resource-limited platforms. We summarize typical applications of resource-limited deep learning and point out that deep learning is an indispensable impetus of pervasive computing. Subsequently, we explore the underlying reasons for the high computational overhead of DL through reviewing the fundamental concepts including capacity, generalization, and backpropagation of a neural network. Guided by these concepts, we investigate on principles of representative research works, as well as three types of solutions: algorithmic design, computational optimization, and hardware revolution. In pursuant to these solutions, we identify challenges to be addressed.

show abstract

Hardware Architecture for Deep Neural Network Accelerator

Yuan

Huang

Zhang

2022

Wiley Encyclopedia of Electrical and Electronics Engineering

View full text Add to dashboard Cite

Deep neural networks (DNNs) have become the most important and popular machine learning technique in the emerging artificial intelligence era. Because of their inherent large‐scale sizes, DNN models are both computation intensive and storage intensive, thereby posing huge challenges for efficient deployment. To overcome this problem, a promising solution is to build customized hardware accelerators to improve the processing speed and energy efficiency when executing DNNs. However, the architecture design of specialized DNN accelerator is nontrivial, given the massive amount of data movements, the rapid development of the DNN algorithms and models, the high demand of reconfigurability and programmability, and the strict requirement of preserving accuracy. To date, many different types of design solutions, varying on device, circuit, architecture, and algorithm levels, have been proposed and implemented in recent years. This article focuses on the review of digital CMOS‐based DNN hardware architecture. By analyzing the design requirements and challenges of DNN accelerators within the classical von Neumann framework, we introduce the basic underlying hardware architecture and computation mapping strategy. Based on that, the advanced optimization techniques are also described. The open problems and challenges for the future DNN hardware architecture are also analyzed and elaborated.

show abstract

PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

Cited by 85 publications

References 53 publications

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Deep Learning on Computational-Resource-Limited Platforms: A Survey

Hardware Architecture for Deep Neural Network Accelerator

Contact Info

Product

Resources

About