CHARM:
 C
 omposing
 H
 eterogeneous
 A
 ccele
 R
 ators for
 M
 atrix Multiply on Versal ACAP Architecture

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Du, Yubo; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex K.; Hu, Jingtong; Chen, Deming; Cong, Jason; Zhou, Peipei

doi:10.1145/3543622.3573210

Cited by 10 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This happens, e.g., when the FP32 output from Softmax needs to be used as the input of the next matrix-multiply layer. We deploy the same INT8 quantized model, DeiT-T, on the AMD ACAP architecture [30] VCK190 [27] board using CHARM [19]. CHARM [19] is the state-of-the-art deep learning inference accelerator and mapping framework on ACAP architecture, which features FPGA, AIE vector processors, and CPU on the system-on-chip.…”

Section: Design Challenges and Proposed Solutionmentioning

confidence: 99%

“…However, it assumes a very flexible Network-on-Chip (NoC) to connect the accelerators which consumes non-negligible resources and may cause large overhead because of the data congestion in the NoC. CHARM [19] composes heterogeneous accelerators for deep learning applications on ACAP. However, CHARM does not support on-chip data forwarding which results in longer inference latency.…”

Section: Hybrid Acceleratorsmentioning

confidence: 99%

“…We deploy the same INT8 quantized model, DeiT-T, on the AMD ACAP architecture [30] VCK190 [27] board using CHARM [19]. CHARM [19] is the state-of-the-art deep learning inference accelerator and mapping framework on ACAP architecture, which features FPGA, AIE vector processors, and CPU on the system-on-chip. The end-to-end latency when using CHARM [19] is 12ms, 8.4x larger than that of GPU A10G under batch size 6.…”

Section: Design Challenges and Proposed Solutionmentioning

confidence: 99%

“…CHARM [19] is the state-of-the-art deep learning inference accelerator and mapping framework on ACAP architecture, which features FPGA, AIE vector processors, and CPU on the system-on-chip. The end-to-end latency when using CHARM [19] is 12ms, 8.4x larger than that of GPU A10G under batch size 6. The main reason is that CHARM maps heterogeneous accelerators on ACAP and the data transfer among accelerators has to go to/from off-chip DDR.…”

Section: Design Challenges and Proposed Solutionmentioning

confidence: 99%

“…If a design requires higher throughput which can be achieved by batching more data, the system would have to sacrifice latency. While users can only explore latency throughput tradeoff by changing the batch size when using the off-the-shelf deep learning framework on GPUs, FPGA accelerators [12,13,14,15,16] and other tiled accelerators [17,18,19,20,21,22,23,24] provide more flexibility and users have a larger design space to explore the latency throughput tradeoff.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Zhuang,

Yang,

et al. 2024

Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX. CCS CONCEPTS• Computer systems organization → Heterogeneous (hybrid) systems; • Hardware → Hardware-software codesign.

show abstract

Section: Design Challenges and Proposed Solutionmentioning

confidence: 99%

Section: Hybrid Acceleratorsmentioning

confidence: 99%