Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond

Kim, Jin Hyun; Kang, Shin-haeng; Lee, Sukhan; Kim, Hyeonsu; Song, Woongjae; Ro, Yuhwan; Lee, Seungwon; Wang, David; Shin, Hyunsung; Phuah, Bengseng; Choi, Joo-Ho; So, Jinin; Cho, Yeongon; Song, Joonho; Choi, Jangseok; Cho, Jeonghyeon; Sohn, Kiwon; Sohn, Young-Soo; Park, Kwang-Il; Kim, Nam Sung

doi:10.1109/hcs52781.2021.9567191

“…There are also variety of prior works that leverage PIM for GEMV operations [25,40,44,48,49,68,83] due to their inherent potential in benefits towards bandwidth bound applications. However, none of these works enable simultaneous execution of PIM and NPU operations, necessary for the efficient execution of LLM inference.…”

Section: Discussionmentioning

confidence: 99%

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Heo,

Lee,

Cho

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

1

0

View full text Add to dashboard Cite

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively.Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a "blocked" mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naïve NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3×, 2.4× and 1.6× throughput improvement, respectively.

show abstract

“…NPUs: Examples of NPUs include Google's Tensor Processing Unit (TPU) [20], tensor cores in NVIDIA A100 Ampere architecture, Samsung NPU [21], Sambanova's RDU [22], IBM's AI Accelerator [23], Microsoft Brainwave [24], Tesla's Self-Driving computer [25], Facebook's ML accelerator [26], etc. NPU architectures can be standalone, a co-processor, or a near-data processing engine [27]- [29]. Most NPUs are spatial architectures (e.g., Fig.…”

Section: A Npu Design Requirements and Challengesmentioning

confidence: 99%

Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems

Dave

¹

,

Marchisio

²

,

Hanif

³

et al. 2022

2022 IEEE 40th VLSI Test Symposium (VTS)

View full text Add to dashboard Cite

The real-world use cases of Machine Learning (ML) have exploded over the past few years. However, the current computing infrastructure is insufficient to support all realworld applications and scenarios. Apart from high efficiency requirements, modern ML systems are expected to be highly reliable against hardware failures as well as secure against adversarial and IP stealing attacks. Privacy concerns are also becoming a first-order issue. This article summarizes the main challenges in agile development of efficient, reliable and secure ML systems, and then presents an outline of an agile design methodology to generate efficient, reliable and secure ML systems based on user-defined constraints and objectives.

show abstract

“…NPUs: Examples of NPUs include Google's Tensor Processing Unit (TPU) [20], tensor cores in NVIDIA A100 Ampere architecture, Samsung NPU [21], Sambanova's RDU [22], IBM's AI Accelerator [23], Microsoft Brainwave [24], Tesla's Self-Driving computer [25], Facebook's ML accelerator [26], etc. NPU architectures can be standalone, a co-processor, or a near-data processing engine [27]- [29]. Most NPUs are spatial architectures (e.g., Fig.…”

Section: A Npu Design Requirements and Challengesmentioning

confidence: 99%

Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems

Dave,

Marchisio,

Hanif

et al. 2022

Preprint

View full text Add to dashboard Cite

The real-world use cases of Machine Learning (ML) have exploded over the past few years. However, the current computing infrastructure is insufficient to support all realworld applications and scenarios. Apart from high efficiency requirements, modern ML systems are expected to be highly reliable against hardware failures as well as secure against adversarial and IP stealing attacks. Privacy concerns are also becoming a first-order issue. This article summarizes the main challenges in agile development of efficient, reliable and secure ML systems, and then presents an outline of an agile design methodology to generate efficient, reliable and secure ML systems based on user-defined constraints and objectives.

show abstract

Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond

Cited by 24 publications

References 0 publications

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems

Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems

Contact Info

Product

Resources

About