Anant Nori scite author profile

Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm. We show that prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design.To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and systemaware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.

show abstract

Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-Level Cache Hierarchies

Nori¹,

Gaur

Rai

et al. 2018

View full text Add to dashboard Cite

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

Ferreira¹,

Falcão

Gómez-Luna³

et al. 2022

View full text Add to dashboard Cite

REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

Nori

Bera

Balachandran

et al. 2021

View full text Add to dashboard Cite

DSPatch

Bera

Nori

Mutlu

et al. 2019

View full text Add to dashboard Cite

Cryptographic Capability Computing

LeMay

Rakshit

Deutsch

et al. 2021

View full text Add to dashboard Cite

TransforMAP: Transformer for Memory Access Prediction

Zhang¹,

Srivastava²,

Nori³

et al. 2022

Preprint

View full text Add to dashboard Cite

Data Prefetching is a technique that can hide memory latency by fetching data before it is needed by a program. Prefetching relies on accurate memory access prediction, to which task machine learning based methods are increasingly applied. Unlike previous approaches that learn from deltas or offsets and perform one access prediction, we develop TransforMAP, based on the powerful Transformer model, that can learn from the whole address space and perform multiple cache line predictions. We propose to use the binary of memory addresses as model input, which avoids information loss and saves a token table in hardware. We design a block index bitmap to collect unordered future page offsets under the current page address as learning labels. As a result, our model can learn temporal patterns as well as spatial patterns within a page. In a practical implementation, this approach has the potential to hide prediction latency because it prefetches multiple cache lines likely to be used in a long horizon. We show that our approach achieves 35.67% MPKI improvement and 20.55% IPC improvement in simulation, higher than state-of-the-art Best-Offset prefetcher and ISB prefetcher.

show abstract

pLUTo: Enabling Massively Parallel Computation In DRAM via Lookup Tables

Ferreira¹,

Falcão²,

Gómez-Luna³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anant Nori

Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-Level Cache Hierarchies

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

DSPatch

Cryptographic Capability Computing

TransforMAP: Transformer for Memory Access Prediction

pLUTo: Enabling Massively Parallel Computation In DRAM via Lookup Tables

Contact Info

Product

Resources

About