Armand Behroozi scite author profile

Armand Behroozi

4Publications

21Citation Statements Received

164Citation Statements Given

How they've been cited

How they cite others

174

161

Affiliations

University of Michigan–Ann Arbor

Publications

Order By: Most citations

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati

May

Behroozi

et al. 2021

View full text Add to dashboard Cite

Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse row/column (CSR/CSC), to conserve space at the cost of complicated, irregular traversals. Such traversals access large volumes of data and offer little locality for caches and conventional prefetchers to exploit. This paper presents Prodigy, a low-cost hardware-software codesign solution for intelligent prefetching to improve the memory latency of several important irregular workloads. Prodigy targets irregular workloads including graph analytics, sparse linear algebra, and fluid mechanics that exhibit two specific types of datadependent memory access patterns. Prodigy adopts a "best of both worlds" approach by using static program information from software, and dynamic run-time information from hardware. The core of the system is the Data Indirection Graph (DIG)-a proposed compact representation used to express program semantics such as the layout and memory access patterns of key data structures. The DIG representation is agnostic to a particular data structure format and is demonstrated to work with several sparse formats including CSR and CSC. Program semantics are automatically captured with a compiler pass, encoded as a DIG, and inserted into the application binary. The DIG is then used to program a low-cost hardware prefetcher to fetch data according to an irregular algorithm's data structure traversal pattern. We equip the prefetcher with a flexible prefetching algorithm that maintains timeliness by dynamically adapting its prefetch distance to an application's execution pace.We evaluate the performance, energy consumption, and transistor cost of Prodigy using a variety of algorithms from the GAP, HPCG, and NAS benchmark suites. We compare the performance of Prodigy against a non-prefetching baseline as well as stateof-the-art prefetchers. We show that by using just 0.8KB of storage, Prodigy outperforms a non-prefetching baseline by 2.6× and saves energy by 1.6×, on average. Prodigy also outperforms modern data prefetchers by 1.5-2.3×.Index Terms-DRAM stalls, irregular workloads, graph processing, hardware-software co-design, programming model, programmer annotations, compiler, and hardware prefetching. Program the prefetcher Program the prefetcher Generate prefetch requests Generate prefetch requestsCompiler analysis Run application Software Hardware Instrumented application binary Application source code Add DIG representation Data Indirection Graph (DIG)

show abstract

SRTuner: Effective Compiler Optimization Customization by Exposing Synergistic Relations

Park

Latifi

Park

et al. 2022

View full text Add to dashboard Cite

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

Lym¹,

Behroozi²,

Wang³

et al. 2018

Preprint

View full text Add to dashboard Cite

Training convolutional neural networks (CNNs) requires intense computations and high memory bandwidth. We find that bandwidth today is over-provisioned because most memory accesses in CNN training can be eliminated by rearranging computation to better utilize on-chip buffers and avoid traffic resulting from large per-layer memory footprints. We introduce the MBS CNN training approach that significantly reduces memory traffic by partially serializing mini-batch processing across groups of layers. This optimizes reuse within on-chip buffers and balances both intra-layer and inter-layer reuse. We also introduce the WaveCore CNN training accelerator that effectively trains CNNs in the MBS approach with high functional-unit utilization. Combined, WaveCore and MBS reduce DRAM traffic by 75%, improve performance by 53%, and save 26% system energy for modern deep CNN training compared to conventional training mechanisms and accelerators.

show abstract

Loner: utilizing the CPU vector datapath to process scalar integer data

Behroozi

Park

Mahlke

2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Armand Behroozi

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

SRTuner: Effective Compiler Optimization Customization by Exposing Synergistic Relations

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

Loner: utilizing the CPU vector datapath to process scalar integer data

Contact Info

Product

Resources

About