2022

DOI: 10.1109/access.2022.3203051

|View full text |Cite

|

Sign up to set email alerts

|

Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling

¹

,

²

,

³

et al.

Abstract: Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality dataintensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction5

Citation Types

Supporting

0

Mentioning

13

Contrasting

0

Year Published

2023

2023

2024

2024

Publication Types

Select...

Article2

Other2

Relationship

Self Cite1

Independent3

Authors

Journals

Cited by 4 publications

(13 citation statements)

References 43 publications

(114 reference statements)

Supporting

0

Mentioning

13

Contrasting

0

Order By: Relevance

“…Processing-in-Memory (PIM) architectures have been actively studied by placing computing units close to [9], [10], [11], [12], and [13] or inside memory [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26] to overcome the memory bandwidth limitation. PIM can maximize internal memory bandwidth for the computation using bank-level parallelism [14], [15], [17], [18], [22], [23], [24], [25], [26], thus providing high computation performance. For example, the decoupled PIM [26] achieved a speedup of 75.8x and 1.2x over CPU and GPU at the Level-3 BLAS, respectively.…”

Section: Introductionmentioning

confidence: 99%

“…PIM can maximize internal memory bandwidth for the computation using bank-level parallelism [14], [15], [17], [18], [22], [23], [24], [25], [26], thus providing high computation performance. For example, the decoupled PIM [26] achieved a speedup of 75.8x and 1.2x over CPU and GPU at the Level-3 BLAS, respectively. Samsung FIM [23] achieved a speedup of 11.2x and 3.5x over CPU for memory-bound neural network kernels and applications [27], [29], [32], respectively.…”

Section: Introductionmentioning

confidence: 99%

“…For example, the latest PIM studies from Samsung [23] and UPMEM [30] separated the PIM memory area from the non-PIM memory to avoid incompatibility with the JEDEC memory standard [31] for supporting all-bank execution. Our recent work and baseline for this research, the decoupled PIM [26] satisfies the standard memory interface. However, its performance is lower than the all-bank PIMs due to its perbank execution.…”

Section: Introductionmentioning

confidence: 99%

“…It increases the hardware cost and raises the system performance issues, such as making the core very busy and incurring high latency to access uncacheable PIM areas. Recent PIMs use Direct Memory Access (DMA) as the offloading mechanism [22], [25], [26], [30], [32] to resolve the performance issue by transferring opcodes and large-size operands without CPU intervention.…”

Section: Introductionmentioning

confidence: 99%

“…However, if we express the opcode and operands in a single descriptor together, we can eliminate the opcode descriptors. The elimination allowed us to reduce the total number of descriptors by 25.8%, 26.1%, and 24.9%, thus achieving significant speedups of 1.25x, 1.31x, and 1.29x compared to the baseline PIM [26] in BERT [33], RoBERTa [34], and GPT-2 [35] models, respectively, with a sequence length of 128.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

¹

,

²

,

³

et al. 2023

Self Cite

View full text Add to dashboard Cite

Processing-in-memory (PIM) has attracted attention to overcome the memory bandwidth limitation, especially for computing memory-intensive DNN applications. Most PIM approaches use the CPU's memory requests to deliver instructions and operands to the PIM engines, making a core busy and incurring unnecessary data transfer, thus, resulting in significant offloading overhead. DMA can resolve the issue by transferring a high volume of successive data without intervening CPU and polluting the memory hierarchy, thus perfectly fitting the PIM concept. However, the small computing resources of DRAM-based PIM devices allow us to transfer only small amounts of data at one DMA transaction and require a large number of descriptors, thus still incurring significant offloading overhead. This paper introduces PIM Instruction Set Architecture (ISA) using a DMA descriptor called PISA-DMA to express a PIM opcode and operand in a single descriptor. Our ISA makes PIM programming intuitive by thinking of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Also, PISA-DMA minimizes the offloading overhead while guaranteeing compatibility with commercial platforms. Our PISA-DMA eliminates the opcode offloading overhead and achieves 1.25x, 1.31x, and 1.29x speedup over the baseline PIM at the sequence length of 128 with the BERT, RoBERTa, and GPT-2 models, respectively, in ONNX runtime with real machines. Also, we study how our proposed PISA affects performance in compiler optimization and show that the operator fusion of matrix-matrix multiplication and element-wise addition achieved 1.04x speedup, a similar performance gain using conventional ISAs.

“…Processing-in-Memory (PIM) architectures have been actively studied by placing computing units close to [9], [10], [11], [12], and [13] or inside memory [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26] to overcome the memory bandwidth limitation. PIM can maximize internal memory bandwidth for the computation using bank-level parallelism [14], [15], [17], [18], [22], [23], [24], [25], [26], thus providing high computation performance. For example, the decoupled PIM [26] achieved a speedup of 75.8x and 1.2x over CPU and GPU at the Level-3 BLAS, respectively.…”

Section: Introductionmentioning

confidence: 99%

“…PIM can maximize internal memory bandwidth for the computation using bank-level parallelism [14], [15], [17], [18], [22], [23], [24], [25], [26], thus providing high computation performance. For example, the decoupled PIM [26] achieved a speedup of 75.8x and 1.2x over CPU and GPU at the Level-3 BLAS, respectively. Samsung FIM [23] achieved a speedup of 11.2x and 3.5x over CPU for memory-bound neural network kernels and applications [27], [29], [32], respectively.…”

Section: Introductionmentioning

confidence: 99%

“…For example, the latest PIM studies from Samsung [23] and UPMEM [30] separated the PIM memory area from the non-PIM memory to avoid incompatibility with the JEDEC memory standard [31] for supporting all-bank execution. Our recent work and baseline for this research, the decoupled PIM [26] satisfies the standard memory interface. However, its performance is lower than the all-bank PIMs due to its perbank execution.…”

Section: Introductionmentioning

confidence: 99%

“…It increases the hardware cost and raises the system performance issues, such as making the core very busy and incurring high latency to access uncacheable PIM areas. Recent PIMs use Direct Memory Access (DMA) as the offloading mechanism [22], [25], [26], [30], [32] to resolve the performance issue by transferring opcodes and large-size operands without CPU intervention.…”

Section: Introductionmentioning

confidence: 99%

“…However, if we express the opcode and operands in a single descriptor together, we can eliminate the opcode descriptors. The elimination allowed us to reduce the total number of descriptors by 25.8%, 26.1%, and 24.9%, thus achieving significant speedups of 1.25x, 1.31x, and 1.29x compared to the baseline PIM [26] in BERT [33], RoBERTa [34], and GPT-2 [35] models, respectively, with a sequence length of 128.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

¹

,

²

,

³

et al. 2023

Self Cite

View full text Add to dashboard Cite

Processing-in-memory (PIM) has attracted attention to overcome the memory bandwidth limitation, especially for computing memory-intensive DNN applications. Most PIM approaches use the CPU's memory requests to deliver instructions and operands to the PIM engines, making a core busy and incurring unnecessary data transfer, thus, resulting in significant offloading overhead. DMA can resolve the issue by transferring a high volume of successive data without intervening CPU and polluting the memory hierarchy, thus perfectly fitting the PIM concept. However, the small computing resources of DRAM-based PIM devices allow us to transfer only small amounts of data at one DMA transaction and require a large number of descriptors, thus still incurring significant offloading overhead. This paper introduces PIM Instruction Set Architecture (ISA) using a DMA descriptor called PISA-DMA to express a PIM opcode and operand in a single descriptor. Our ISA makes PIM programming intuitive by thinking of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Also, PISA-DMA minimizes the offloading overhead while guaranteeing compatibility with commercial platforms. Our PISA-DMA eliminates the opcode offloading overhead and achieves 1.25x, 1.31x, and 1.29x speedup over the baseline PIM at the sequence length of 128 with the BERT, RoBERTa, and GPT-2 models, respectively, in ONNX runtime with real machines. Also, we study how our proposed PISA affects performance in compiler optimization and show that the operator fusion of matrix-matrix multiplication and element-wise addition achieved 1.04x speedup, a similar performance gain using conventional ISAs.

BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM

Kim,

Lee,

Paik

et al. 2023

View full text Add to dashboard Cite

No abstract

Supporting Multi-Channels to DRAM-based PIM Execution for Boosting the Performance

Kim,

Kim,

Kim

2024

2024 International Conference on Electronics, Information, and Communication (ICEIC)

View full text Add to dashboard Cite

No abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.