SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Niu, Yue; Zeng, Hanqing; Srivastava, Ajitesh; Lakhotia, Kartik; Kannan, Rajgopal; Wang, Yanzhi; Prasanna, Viktor K.

doi:10.1109/hipc.2019.00033

Cited by 11 publications

(11 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They keep the input activations in SRAM and stream the sparse kernels. A similar design [Niu et al 2019] streams activations with stationary weights. Both have limited reuse due to the limited BRAM (on-chip SRAM) on FPGAs.…”

Section: Inference Acceleratorsmentioning

confidence: 99%

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

View full text Add to dashboard Cite

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.. The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience -Albert Einstein, 1933 INTRODUCTIONDeep learning shows unparalleled promise for solving very complex real-world problems in areas such as computer vision, natural language processing, knowledge representation, recommendation systems, drug discovery, and many more. With this development, the field of machine learning is moving from traditional feature engineering to neural architecture engineering. However, still little is known about how to pick the right architecture to solve a specific task. Several methods such as translational equivariance in convolutional layers, recurrence, structured weight sharing, pooling, or locality are used to introduce strong inductive biases in the model design. Yet, the exact model size and capacity required for a task remain unknown and a common strategy is to train overparameterized models and compress them into smaller representations. Biological brains, especially the human brain, are hierarchical, sparse, and recurrent structures [Friston 2008] and one can draw some similarities with the inductive biases in today's artificial neural networks. Sparsity plays an important role in scaling biological brains-the more

show abstract

Section: Inference Acceleratorsmentioning

confidence: 99%

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To make a fair comparison across different platforms, we also present the DSP-efficiency and logic-efficiency on each platform. On average, our design exhibits 0.24 GOP/s/DSP DSP-efficiency, which shows 2.5X-5.7X improvement compared with prior works [16,21,48]. On the other hand, our design shows lower logic-efficiency.…”

Section: B Performance Analysismentioning

confidence: 68%

“…The performance on VGG network is 309.0 GOP/s which is 3.6X-4.8X higher than [16,21]. [48] shows higher performance because they pruned the network in the frequency domain which results in elementwise multiplication pattern. This computation pattern shows less complexity compared with the convolution operator.…”

Section: B Performance Analysismentioning

confidence: 98%

See 1 more Smart Citation

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

Liang

Jin

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Deep convolutional neural networks (CNNs) have achieved remarkable performance at the cost of huge computation. As the CNN models become more complex and deeper, compressing CNNs to sparse by pruning the redundant connection in the networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. On the other hand, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA accelerators focus on dense CNN models which are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. In this work, we propose an accelerator with softwarehardware co-design for sparse CNNs on FPGAs. To efficiently deal with the irregular connections in the sparse convolutional layers, we propose a weight-oriented dataflow that exploits element-matrix multiplication as the key operation. Each weight is processed individually which yields low decoding overhead. Then we design an FPGA accelerator that features a tile lookup table (TLUT) and a channel multiplexer (CMUX). The tile look-up table is designed to match the index between sparse weights and input pixels. Using TLUT, the runtime decoding overhead is mitigated by using an efficient indexing operation. Moreover, we propose a weight layout to enable efficient on-chip memory access without conflicts. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address. Last, we build a Neural Architecture Search (NAS) engine that leverages the reconfigurability of FPGAs to generate an efficient CNN model and choose the optimal hardware design parameters. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 2.4X-12.9X speedup over previous dense CNN accelerators on FPGAs. Our FPGA-aware NAS approach shows 2X speedup over MobileNetV2 with 1.5% accuracy loss.

show abstract

“…1a). Besides, additional control logic is required to compute operations (e.g matrix multiplication) with such formats, increasing the complexity and power consumption for embedded applications [14], [15], [16]. Therefore, we propose a framework that naturally generates structured sparsity for several levels of granularity, by fixing the number of active elements within a candidate set (comprising e.g.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities

Gonzalez-Carabarin¹,

Huijben²,

Veeling³

et al. 2021

Preprint

View full text Add to dashboard Cite

Unstructured neural network pruning algorithms have achieved impressive compression rates. However, the resulting -typically irregular -sparse matrices hamper efficient hardware implementations, leading to additional memory usage and complex control logic that diminishes the benefits of unstructured pruning. This has spurred structured coarse-grained pruning solutions that prune entire filters or even layers, enabling efficient implementation at the expense of reduced flexibility. Here we propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps), while retaining efficient memory-organization (e.g. pruning exactly k-out-of-n weights for every output neuron, or pruning exactly k-out-of-n kernels for every feature map). We refer to this algorithm as Dynamic Probabilistic Pruning (DPP). DPP leverages the Gumbel-softmax relaxation for differentiable k-out-of-n sampling, facilitating end-to-end optimization. We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification. Relevantly, the non-magnitude-based nature of DPP allows for joint optimization of pruning and weight quantization in order to even further compress the network, which we show as well. Finally, we propose novel information theoretic metrics that show the confidence and pruning diversity of pruning masks within a layer.

show abstract

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Cited by 11 publications

References 22 publications

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities

Contact Info

Product

Resources

About