OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Liang, Yun; Lu, Liqiang; Xie, Jingya

doi:10.1109/tcad.2020.3023903

Cited by 27 publications

(15 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sparse tensor accelerators. [13,25,26,44,50,51,89,92] are sparse DNN accelerators. MAERI [41] uses tree-based interconnects for data distribution and reduction which is similar to our reconfigurable adder tree.…”

Section: Related Workmentioning

confidence: 99%

Morphling: A Reconfigurable Architecture for Tensor Computation

Liang

2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

Tensor algebra plays a major role in various applications including data analysis, machine learning, and hydrodynamics simulation. Different tensor algebra inherently varies in dimension, size, and computation, leading to different execution preference, including parallelization, data arrangement, and accumulation. Another critical aspect for tensor algebra is the involved tensors can be with varying mixes of dense and sparse representation. Such diversified applications are notoriously difficult to accelerate. Prior ASIC architectures do not meet the needs due to fixed dataflow and prior fine-grained fabrics (e.g. FPGAs) solutions offer limited performance and power improvement due to bit-level reconfigurable structure.In this paper, we propose Morphling, a reconfigurable architecture that can flexibly handle both dense and sparse tensor computation. We first generalize a flexible execution model that decomposes tensor operations into three steps, including tensor vectorization, vector computation, and output reduction. The dense and sparse tensor computation share the same execution model, but differ in the vector computation step where the multiplications are conducted. Depending on the number of inputs and outputs that are linked together in the computation step, we define three parallel patterns including many-to-one, one-to-many, and one-to-one, which correspond to different implementation for dense and sparse computation. Furthermore, to efficiently support sparse tensor, we design a tiled-BCSR format that enables high parallelism and balanced workload. At architecture level, we propose a reconfigurable design to support the execution model. The hardware units can be reconfigured to support different datapath and enable different types of data reuse. We evaluate Morphling using various tensor operations and compare it with CPU, GPU, FPGA and state-of-the-art ASIC designs. Overall, Morphling achieves 13.4X, 677.7X, 44.7X energy efficiency over Xilinx ZC706 FPGA, Intel i7-9700K CPU, and NVIDIA TitanX GPU.

show abstract

Section: Related Workmentioning

confidence: 99%

Morphling: A Reconfigurable Architecture for Tensor Computation

Liang

2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, the latency of MRAM tends to be substantially larger than that of SRAM latency [58]. Moreover, the bandwidth of local memory also varies between memory blocks, depending on the number of banks allocated to those blocks (e.g., [19] and [59,60]). Thus, each PE tends to experience an order-of-magnitude difference in its latency and bandwidth, depending on which memory block the activations (or filters) are transferred from/to.…”

Section: Spatial Data Dependence Graphmentioning

confidence: 99%

Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network Dataflows

Wang

Park

2022

IEEE Access

View full text Add to dashboard Cite

In this paper, a new pre-RTL simulator is proposed to predict the power, performance, and area of convolutional neural network (CNN) dataflows prior to register-transfer-level (RTL) design. In the simulator, a novel approach is adopted to implement a spatial data dependence graph (SDDG), which enables us to model a specific dataflow alongside inter-instruction dependencies by tracking the status of each processing element (PE). In addition, the proposed pre-RTL simulator makes it possible to evaluate the impact of memory constraints such as latency and bandwidth. The latency-insensitive and bandwidth-insensitive PE controllers assumed in the proposed pre-RTL simulator guarantee both operational correctness and maximum performance, regardless of memory constraints. In particular, it is shown that the optimal distribution method of local memory bandwidth can reduce the accelerator performance by up to 37.6% compared with the equal distribution method. For weight stationary (WS) and row stationary (RS) dataflows, the accelerator performance closely depends on memory constraints. The simulation results also show that the relative performances of dataflows depend on the layer shape of the convolutional layer. For example, for an identical hardware area in a standard convolutional layer of AlexNet, WS dataflows do not provide any performance gain over RS dataflows when the memory latency is sufficiently high. In addition, WS dataflows cannot fully reuse the input activation, thereby increasing local memory accesses, since the number of weights loaded at a specific time is limited. Moreover, in a depth-wise convolutional layer of MobileNet, WS dataflows tend to outperform RS dataflows even in the presence of large memory latency.

show abstract

“…Most of these works optimize their dataflows based on loop operations like loop interchange and loop unrolling [16][17][18][19][20]. The dense accelerator can result in high hardware inefficiency since most multiplication operations involve zero operands [5,6,16,[21][22][23][24][25]. Implementation of sparse DNNs has been studied in recent years on FPGAs [26].…”

Section: Introductionmentioning

confidence: 99%

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

Liang

Jin

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

Deep convolutional neural networks (CNNs) have achieved remarkable performance at the cost of huge computation. As the CNN models become more complex and deeper, compressing CNNs to sparse by pruning the redundant connection in the networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. On the other hand, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA accelerators focus on dense CNN models which are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. In this work, we propose an accelerator with softwarehardware co-design for sparse CNNs on FPGAs. To efficiently deal with the irregular connections in the sparse convolutional layers, we propose a weight-oriented dataflow that exploits element-matrix multiplication as the key operation. Each weight is processed individually which yields low decoding overhead. Then we design an FPGA accelerator that features a tile lookup table (TLUT) and a channel multiplexer (CMUX). The tile look-up table is designed to match the index between sparse weights and input pixels. Using TLUT, the runtime decoding overhead is mitigated by using an efficient indexing operation. Moreover, we propose a weight layout to enable efficient on-chip memory access without conflicts. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address. Last, we build a Neural Architecture Search (NAS) engine that leverages the reconfigurability of FPGAs to generate an efficient CNN model and choose the optimal hardware design parameters. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 2.4X-12.9X speedup over previous dense CNN accelerators on FPGAs. Our FPGA-aware NAS approach shows 2X speedup over MobileNetV2 with 1.5% accuracy loss.

show abstract

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Cited by 27 publications

References 45 publications

Morphling: A Reconfigurable Architecture for Tensor Computation

Morphling: A Reconfigurable Architecture for Tensor Computation

Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network Dataflows

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

Contact Info

Product

Resources

About