2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2021
DOI: 10.1109/hpca51647.2021.00061
|View full text |Cite
|
Sign up to set email alerts
|

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Abstract: Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse row/column (CSR/CSC), to conserve space at the cost of complicated, irregular traversals. Such traversals access large volumes of data and offer little locality for caches and conventional prefetchers to exploit. This paper presents Prodigy, a low-cost hardware-software codesign solution for intelligent prefetching to improve the memory latency of several important… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 35 publications
(17 citation statements)
references
References 94 publications
(125 reference statements)
0
17
0
Order By: Relevance
“…Optimizing Irregular Memory Accesses Recent work has made significant strides in domain-agnostic prefetching for irregular applications [10,43,57]. Our split-tree structure can be seen as an application-specific prefetcher and achieves "perfect prefetching" in that 1) off-chip data accesses are overlapped with computation, 2) data needed by the accelerator are readily available on-chip without stalls, and 3) no redundant DRAM accesses are needed.…”
Section: Related Workmentioning
confidence: 99%
“…Optimizing Irregular Memory Accesses Recent work has made significant strides in domain-agnostic prefetching for irregular applications [10,43,57]. Our split-tree structure can be seen as an application-specific prefetcher and achieves "perfect prefetching" in that 1) off-chip data accesses are overlapped with computation, 2) data needed by the accelerator are readily available on-chip without stalls, and 3) no redundant DRAM accesses are needed.…”
Section: Related Workmentioning
confidence: 99%
“…Prior work has studied many optimizations that transform data as it moves through the cache, e.g., to compress [9,36,90,106,107,118,136,146], decrypt [47,65,115], prefetch [6,131,149], change layout [7,23], memoize [8,40,153,154], or serialize/de-serialize [108] data. We motivate täkō by observing how its onMiss callback enables arbitrary data transformations while improving performance, saving energy, and reducing overall work.…”
Section: Example Program: Lossy Compressionmentioning
confidence: 99%
“…Specialized cache hierarchies. These trends have been widely recognized, and there are many proposals to accelerate data movement, e.g., in machine learning [2,50], graph analytics [92,95,150], data structures [54,58,154], memoization [8,40,153,154], compression [9,36,90,106,107,118,136,146], data layout [7,23,155], prefetching [6,131,149], coherence and synchronization [34,75,151,152], memory management [85,135], and system software [67,108,127]. While highly effective, they share the drawback of requiring custom hardware.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Resolving load-dependent control flow operations at NDP precludes the need for using expensive branch resolution mechanisms on the CPU. Moreover, irregular accesses to graph data structures resulting in high memory latency and/or bandwidth [33,54] can be better serviced near memory at a low latency and high available bandwidth, addressing the two main bottlenecks in GPM workloads (Takeaway 3). In summary, NDP is an attractive candidate for accelerating GPM workloads.…”
Section: Why Ndp For Gpm?mentioning
confidence: 99%