Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

Huang, Sitao; Ankit, Aayush; Silveira, Plínio; Antunes, Rodrigo; Chalamalasetti, Sai Rahul; Hajj, I.N.; Kim, Dong-Eun; Aguiar, Glaucimar; Bruel, Pedro; Serebryakov, Sergey; Xu, Cong; Li, Can; Faraboschi, Paolo; Strachan, John Paul; Chen, Deming; Roy, Kaushik; Hwu, Wen-mei W.; Milojičić, Dejan

doi:10.1145/3394885.3431554

Cited by 26 publications

(14 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pruning [19,71,78,79,132,134,140,171,200,265,288] Quantization [19,68,90,134,166,179,291,307,311,314] Knowledge Distillation [29,41,42,80,83,88,95,170,186,195,220,228,231,239,257,266,267,274,295,296,300,312] Low rank factorization [76,98,119,168,190,196,210,292] Conditional Computation…”

Section: Model Compressionmentioning

confidence: 99%

“…With the proposed dynamic programming assisted quantisation approach, the authors demonstrated a 16× compression in a ResNet-18 model with less than a 3% accuracy drop. The authors in [90], proposed a quantisation scheme for DNN inference that targets weights along with the inputs to the model and the partial sums occurring inside the hardware accelerator. Experiments showed that the proposed schema reduced the inference latency and energy consumption by up to 3.89× and 4.84× respectively while experiencing a 1.18% loss in the DNN inference accuracy.…”

Section: Model Compressionmentioning

confidence: 99%

See 1 more Smart Citation

Enabling Deep Learning for All-in EDGE paradigm

Joshi¹,

Hasanuzzaman²,

Thapa³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep Learning-based models have been widely investigated, and they have demonstrated significant performance on non-trivial tasks such as speech recognition, image processing, and natural language understanding. However, this is at the cost of substantial data requirements. Considering the widespread proliferation of edge devices (e.g., Internet of Things devices) over the last decade, Deep Learning in the edge paradigm, such as device-cloud integrated platforms, is required to leverage its superior performance. Moreover, it is suitable from the data requirements perspective in the edge paradigm because the proliferation of edge devices has resulted in an explosion in the volume of generated and collected data. However, there are difficulties due to other requirements such as high computation, high latency, and high bandwidth caused by Deep Learning applications in real-world scenarios. In this regard, this survey paper investigates Deep Learning at the edge, its architecture, enabling technologies, and model adaption techniques, where edge servers and edge devices participate in deep learning training and inference. For simplicity, we call this paradigm the All-in EDGE paradigm. Besides, this paper presents the key performance metrics for Deep Learning at the All-in EDGE paradigm to evaluate various deep learning techniques and choose a suitable design. Moreover, various open challenges arising from the deployment of Deep Learning at the All-in EDGE paradigm are identified and discussed.

show abstract

Section: Model Compressionmentioning

confidence: 99%

Section: Model Compressionmentioning

confidence: 99%

Enabling Deep Learning for All-in EDGE paradigm

Joshi¹,

Hasanuzzaman²,

Thapa³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Several works propose hardware accelerators for basecalling [63,77,78] or read mapping [54,[56][57][58]62,[65][66][67][68]71,[79][80][81][82][83]. Among these accelerators, non-volatile memory (NVM)-based processing in memory (PIM) accelerators offer high performance and efficiency since NVM-based PIM provides in-situ and highly-parallel computation support for matrix-vector mul-tiplications (MVM) [101][102][103][104][105][106][107][108][109][110][111] and string matching operations [112][113][114][115][116][117][118][119][120][121][122][123][124][125][126][127][128][129][130]…”

Section: State-of-the-art Solutionsmentioning

confidence: 99%

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Mao¹,

Alser²,

Sadrosadati³

et al. 2022

Preprint

View full text Add to dashboard Cite

Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline.This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6× (8.4×) speedup and 32.8× (20.8×) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39× speedup and 1.37× energy savings.

show abstract

“…Quantization in hardware saves memory space, reduces data movement and latency for arithmetic operations [2]. Full-precision computations and complex arithmetics support for large neural networks in low-power hardware are challenging.…”

Section: Introductionmentioning

confidence: 99%

Towards Efficient RRAM-based Quantized Neural Networks Hardware: State-of-the-art and Open Issues

Krestinskaya

Salama

2022

2022 IEEE 22nd International Conference on Nanotechnology (NANO)

View full text Add to dashboard Cite

The increasing amount of data processed on edge and demand for reducing the energy consumption for large neural network architectures have initiated the transition from traditional von Neumann architectures towards in-memory computing paradigms. Quantization is one of the methods to reduce power and computation requirements for neural networks by limiting bit precision. Resistive Random Access Memory (RRAM) devices are great candidates for Quantized Neural Networks (QNN) implementations. As the number of possible conductive states in RRAMs is limited, a certain level of quantization is always considered when designing RRAM-based neural networks. In this work, we provide a comprehensive analysis of state-of-the-art RRAMbased QNN implementations, showing where RRAMs stand in terms of satisfying the criteria of efficient QNN hardware. We cover hardware and device challenges related to QNNs and show the main unsolved issues and possible future research directions.

show abstract

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

Cited by 26 publications

References 1 publication

Enabling Deep Learning for All-in EDGE paradigm

Enabling Deep Learning for All-in EDGE paradigm

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Towards Efficient RRAM-based Quantized Neural Networks Hardware: State-of-the-art and Open Issues

Contact Info

Product

Resources

About