Yunho Oh scite author profile

Yunho Oh

5Publications

57Citation Statements Received

38Citation Statements Given

How they've been cited

How they cite others

183

Affiliations

Korea University, Yonsei University, Sungkyunkwan University

Publications

Order By: Most citations

Scale-out Systolic Arrays

Yüzügüler

Sönmez

Drumond³

et al. 2023

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt—i.e., throughput/Watt adjusted when accounting for array utilization—poses a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose Scale-out Systolic Arrays , a multi-pod inference accelerator for both single- and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5 ×. 1

show abstract

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Koo

et al. 2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra-and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.

show abstract

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

Kim

Myung

et al. 2016

View full text Add to dashboard Cite

Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

Kim

Ahn

et al. 2020

View full text Add to dashboard Cite

FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput

Yoon

Song

et al. 2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yunho Oh

Scale-out Systolic Arrays

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput

Contact Info

Product

Resources

About