Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning

He, Mingxuan; Song, Choungki; Kim, Il Kon; Jeong, Chunseok; Kim, Se‐Ho; Park, Il; Thottethodi, Mithuna; Vijaykumar, T. N.

doi:10.1109/micro50266.2020.00040

Cited by 85 publications

(48 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, they act as hardware accelerators with high throughput for specific applications. Recently, the DRAM makers SK-Hynix (He et al, 2020) and Samsung (Kwon et al, 2021) introduced 16-bit floating-point processing units inside the DRAM. ePIM architectures have a high area overhead and have to reduce the size of memory arrays to accommodate the added digital logic.…”

Section: Processing In Memorymentioning

confidence: 99%

“…Hence, their throughput and energy benefits show a decreasing trend for higher bit precision. To overcome this shortcoming, the architectures with custom logic (large multipliers and accumulators) He et al, 2020), programmable computing units (Kwon et al, 2021), and LUT-based designs LAcc (Deng et al, 2019), pPIM (Sutradhar et al, 2022), pLUTo (Ferreira et al, 2021) have been proposed. These architectures embed external logic to the DRAM outside the memory array, hence, referred to as ePIM architectures.…”

Section: Prior Work On Logic Operations (Ipim) and Arithmetic Operati...mentioning

confidence: 99%

See 1 more Smart Citation

CIDAN-XE: Computing in DRAM with Artificial Neurons

et al. 2022

View full text Add to dashboard Cite

This paper presents a DRAM-based processing-in-memory (PIM) architecture, called CIDAN-XE. It contains a novel computing unit called the neuron processing element (NPE). Each NPE can perform a variety of operations that include logical, arithmetic, relational, and predicate operations on multi-bit operands. Furthermore, they can be reconfigured to switch operations during run-time without increasing the overall latency or power of the operation. Since NPEs consume a small area and can operate at very high frequencies, they can be integrated inside the DRAM without disrupting its organization or timing constraints. Simulation results on a set of operations such as AND, OR, XOR, addition, multiplication, etc., show that CIDAN-XE achieves an average throughput improvement of 72X/5.4X and energy efficiency improvement of 244X/29X over CPU/GPU. To further demonstrate the benefits of using CIDAN-XE, we implement several convolutional neural networks and show that CIDAN-XE can improve upon the throughput and energy efficiency over the latest PIM architectures.

show abstract

Section: Processing In Memorymentioning

confidence: 99%

Section: Prior Work On Logic Operations (Ipim) and Arithmetic Operati...mentioning

confidence: 99%

CIDAN-XE: Computing in DRAM with Artificial Neurons

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Thus, the reduction in total runtime comes from the reduce operation, updating embedding tables, and its PCIe transfer time. The performance of the baseline can be improved by using Processing-in-Memory (PiM) instead of the NPU as proposed in [13]. By deploying a PiM device, the latency of a forward/backward propagation in the top MLP is minimized (Fig.…”

Section: Case Study Ii: Recommendation Systemmentioning

confidence: 99%

Deep Partitioned Training From Near-Storage Computing to DNN Accelerators

Jang

Kim

et al. 2021

IEEE Comput. Arch. Lett.

View full text Add to dashboard Cite

In this paper, we present deep partitioned training to accelerate computations involved in training DNN models. This is the first work that partitions a DNN model across storage devices, an NPU and a host CPU forming a unified compute node for training workloads. To validate the benefit of using the proposed system during DNN training, a trace-based simulator or an FPGA prototype is used to estimate the overall performance and obtain the layer index to be partitioned that provides the minimum latency. As a case study, we select two benchmarks, i.e., vision-related tasks and a recommendation system. As a result, the training time reduces by 12.2∼31.0% with four near-storage computing devices in vision-related tasks with a mini-batch size of 512 and 40.6∼44.7% with one near-storage computing device in the selected recommendation system with a mini-batch size of 64.

show abstract

“…Prior approaches to this challenge fall into one of three categories. The first avoids the challenge altogether by maintaining a copy of the data that is stored in a PIM-friendly layout and not accessed by the CPU [17], [22], [25]. This either duplicates substantial data arrays (possibly > 100GiB) [7], [32], [37] or prevents the CPU from assisting with requests that can tolerate higher response latency [15].…”

Section: Motivation and Challengesmentioning

confidence: 99%

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Cho,

Jung,

Erez

2020

Preprint

View full text Add to dashboard Cite

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrixmatrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate 12× better minimum latency than a CPU and 2.8× greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to 16×) and prior main-memory acceleration approaches (by up to 2.4× compared to the best prior approach).

show abstract

Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning

Cited by 85 publications

References 35 publications

CIDAN-XE: Computing in DRAM with Artificial Neurons

CIDAN-XE: Computing in DRAM with Artificial Neurons

Deep Partitioned Training From Near-Storage Computing to DNN Accelerators

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Contact Info

Product

Resources

About