Jaewoong Sim scite author profile

Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most of the traditional tools, unfortunately, simply provide programmers with a variety of measurements and metrics obtained by running applications, and it is often difficult to map these metrics to understand the root causes of slowdowns, much less decide what next optimization step to take to alleviate the bottleneck. In our approach, we first develop an analytical performance model that can precisely predict performance and aims to provide programmer-interpretable metrics. Then, we apply static and dynamic profiling to instantiate our performance model for a particular input code and show how the model can predict the potential performance benefits. We demonstrate our framework on a suite of micro-benchmarks as well as a variety of computations extracted from real codes.

show abstract

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

Sim

Loh

Kim

et al. 2012

View full text Add to dashboard Cite

Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC

et al. 2016

View full text Add to dashboard Cite

Transparent Hardware Management of Stacked DRAM as Part of Memory

Sim¹,

Alameldeen

Chishti

et al. 2014

View full text Add to dashboard Cite

Recent technology advancements allow for the integration of large memory structures on-die or as a diestacked DRAM. Such structures provide higher bandwidth and faster access time than off-chip memory. Prior work has investigated using the large integrated memory as a cache, or using it as part of a heterogeneous memory system under management of the OS. Using this memory as a cache would waste a large fraction of total memory space, especially for the systems where stacked memory could be as large as off-chip memory. An OS-managed heterogeneous memory system, on the other hand, requires costly usage-monitoring hardware to migrate frequently-used pages, and is often unable to capture pages that are highly utilized for short periods of time.This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Partof-Memory (PoM) seamlessly, without the involvement of the OS. Our PoM architecture effectively manages two different types of memory (slow and fast) combined to create a single physical address space. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Our proposed PoM architecture improves performance by 18.4% over static mapping and by 10.5% over an ideal OS-based dynamic remapping policy.

show abstract

Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Nurvitadhi

Kwon

Jafari

et al. 2019

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jaewoong Sim

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

A performance analysis framework for identifying potential benefits in GPGPU applications

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC

Transparent Hardware Management of Stacked DRAM as Part of Memory

Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Contact Info

Product

Resources

About