McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge

Cho, Seunghwan; Choi, Haerang; Park, Eunhyeok; Shin, Hyunsung; Yoo, Sungjoo

doi:10.1109/access.2020.3011265

Cited by 26 publications

(16 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent work [145] presents a real-world PIM system with programmable near-bank computation units, called FIMDRAM, based on HBM technology [113,153]. The FIMDRAM architecture, designed specifically for machine learning applications, implements a SIMD pipeline with simple multiplyand-accumulate units [44,226]. Compared to the more general-purpose UPMEM PIM architecture, FIMDRAM is focused on a specific domain of applications (i.e., machine learning), and thus it may lack flexibility to support a wider range of applications.…”

Section: Related Workmentioning

confidence: 99%

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Gómez-Luna¹,

Hajj²,

Fernández³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM).Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. CCS Concepts: • Hardware → Dynamic memory; • Computing methodologies → Model development and analysis; • Computer systems organization → Architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Gómez-Luna¹,

Hajj²,

Fernández³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition, as TPU v4 [28] reuses hardware designs of TPU v3 except for several components such as on-chip memory capacity, on-chip interconnect, and DMA, the VU of TPU v4 is the same structure as that of TPU v3. There have been processing-near-DRAM studies [10,14,31] to provide high of-chip memory bandwidth during inference. Because [10,14] use datalow architecture such as Eyeriss v1 [7] and systolic array, they still do not process DW-CONV eiciently.…”

Section: Related Workmentioning

confidence: 99%

“…There have been processing-near-DRAM studies [10,14,31] to provide high of-chip memory bandwidth during inference. Because [10,14] use datalow architecture such as Eyeriss v1 [7] and systolic array, they still do not process DW-CONV eiciently. In contrast, [31] has advantages for memory-intensive operations but has weaknesses for compute-intensive ST-CONV operations.…”

Section: Related Workmentioning

confidence: 99%

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

Lee¹,

Choi²,

Jung³

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6 × and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.

show abstract

“…Note that also DRAM has been used for CIM [74], however, targeting high performance server applications which goes beyond the scope of this work.…”

Section: In-memory Computingmentioning

confidence: 99%

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Jokic¹,

Azarkhish²,

Bonetti³

et al. 2022

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that combines high computational throughput with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly being adapted to support the improved functionalities. Hardware designers can refer to a myriad of accelerator implementations in the literature to evaluate and compare hardware design choices. However, the sheer number of publications and their diverse optimization directions hinder an effective assessment. Existing surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effects of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. In contrast to previous surveys, this work provides a quantitative overview of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. The list of optimizations and their quantitative effects are presented as a construction kit, allowing to assess the design choices for each building block individually. Reported optimizations range from up to 10’000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators.

show abstract

McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge

Cited by 26 publications

References 47 publications

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Contact Info

Product

Resources

About