TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer

Zhou, Minxuan; Xu, Weihong; Kang, Jaeyoung; Rosing, Tajana

doi:10.1109/hpca53966.2022.00082

Cited by 28 publications

(12 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We align the memory specifications of TransPIM, such as HBM timing parameters and capacity, with those used for NeuPIMs and the NPU+PIM baseline. Figure 15 reports the speedup of NeuPIMs over Tran-sPIM [89] NeuPIMs shows an average 228× higher throughput than TransPIM. The significant performance gap is attributed to the effectiveness of GEMM computation executed on the NPU in the case of NeuPIMs, as opposed to PIM in TransPIM.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Heo,

Lee,

Cho

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively.Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a "blocked" mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naïve NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3×, 2.4× and 1.6× throughput improvement, respectively.

show abstract

Section: Resultsmentioning

confidence: 99%

“…PIM for language model support. TransPIM [89] is a PIM solution that accelerates the end-to-end transformer inference using PIM. The work proposes a data loading overhead reduction technique by customizing its dataflow for transformer models.…”

Section: Discussionmentioning

confidence: 99%

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Heo,

Lee,

Cho

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

show abstract

“…While Rowclone [40] proposes bulk data copy of a row data across different banks, significant data movement induced in data analytics are flexible that Rowclone cannot be effectively utilized. TransPIM [51] and GearBox [31] propose specific network-on-chip (NoC) for efficient DRAM internal data movement for target applications. However, their NoCs consume a large area overhead considering the DRAM area constraint.…”

Section: Challenge Of Data Analytics a Internal Data Movement Overhea...mentioning

confidence: 99%

“…PIM and NMP Newton, HBM-PIM, TransPIM, McDRAM, Ambit, and SIMDRAM [15], [16], [30], [41], [43], [51] support the regular or non-condition-oriented workloads to avoid data dependent dataflow by accelerating memory-bound vector operations exploiting internal parallelism of DRAM. [19], [25], [38] accelerates a recommendation system where the gatherand-scatter operations are the main target.…”

Section: Related Workmentioning

confidence: 99%

Genome Sequence of Brevibacillus brevis HK544, an Antimicrobial Bacterium Isolated from Soil in Daejeon, South Korea

Kim

Han

et al. 2021

Microbiol Resour Announc

View full text Add to dashboard Cite

The Brevibacillus brevis HK544 strain, which was isolated from soil, exhibited antimicrobial activity against plant pathogens such as Botrytis cinerea , Phytophthora infestans , and Erwinia amylovora . Here, we report the draft genome sequence of the B. brevis HK544 strain, which consists of one circular chromosome of 6,486,246 bp with a GC content of 47.3%.

show abstract

“…When the size of intermediate data surpasses the allocated memory block size, frameworks need to require more memory. To mitigate the problem, memory pool techniques, optimized by profiling inference process or DL model, can be employed for efficient memory management [90,98]. For example, the memory access pattern can be saved during model conversion and the inference framework can allocate all memory directly during the setup stage.…”

Section: Implications and Suggestionsmentioning

confidence: 99%

Linguistic Semiotics

Wang¹

2020

Peking University Linguistics Research

View full text Add to dashboard Cite

Vision-Language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.

show abstract

TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer

Cited by 28 publications

References 34 publications

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Genome Sequence of Brevibacillus brevis HK544, an Antimicrobial Bacterium Isolated from Soil in Daejeon, South Korea

Linguistic Semiotics

Contact Info

Product

Resources

About