2019
DOI: 10.1109/tc.2018.2876312
|View full text |Cite
|
Sign up to set email alerts
|

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Abstract: Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
46
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 57 publications
(46 citation statements)
references
References 29 publications
0
46
0
Order By: Relevance
“…Each lane consists of a First-In First-Out (FIFO) queue to buffer read and write data. An address generator based on the one presented by Schuiki et al [8] and Conti et al [9] assigns memory addresses to the stream-based accesses performed by the core. The lane can be put into read mode, in which case the address generator is used to fetch data from memory and store it in the FIFO.…”
Section: Data Movermentioning
confidence: 99%
“…Each lane consists of a First-In First-Out (FIFO) queue to buffer read and write data. An address generator based on the one presented by Schuiki et al [8] and Conti et al [9] assigns memory addresses to the stream-based accesses performed by the core. The lane can be put into read mode, in which case the address generator is used to fetch data from memory and store it in the FIFO.…”
Section: Data Movermentioning
confidence: 99%
“…In the following we provide a short overview of the NTX architecture outlined in more details in [12]. The Logic Base (LoB) of a HMC offers a unique opportunity to introduce a PiM as depicted in Figure 1.…”
Section: Architecturementioning
confidence: 99%
“…We combine a small 32 bit RISC-V processor core (RV32IMC) [18] with multiple NTX co-processors. Both operate on shared 64 kB TCDM (reduced from 128 kB in [12]). The memory is divided into 32 banks that are connected to the processors via an interconnect offering single-cycle access latency.…”
Section: A Processing Clustermentioning
confidence: 99%
See 1 more Smart Citation
“…Improvements to accelerator efficiency [27,85], DNN-optimized GPU kernels [23,28], and libraries designed to efficiently leverage instruction set extensions [23,83] have improved the computational efficiency of DNN evaluation. However, improving the memory efficiency of DNN evaluation is an on-going challenge [13,29,37,54,107,148]. The memory intensity of DNN inference is increasing, and the sizes of state-of-art DNNs have grown dramatically in recent years.…”
Section: Introductionmentioning
confidence: 99%