Temporal Prefetching Without the Off-Chip Metadata

Wu, Hao; Nathella, Krishnendra; Pusdesris, Joseph; Sunwoo, Dam; Jain, Anshul; Lin, Calvin

doi:10.1145/3352460.3358300

Cited by 33 publications

(17 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, cold start mitigation strategies should prioritize unloading functions from memory when they are less likely to be invoked, rather than unloading them only when other functions need space. Hence, as demonstrated by other approaches on managing variable sized caches, Time-to-Live caches cannot be applied [4ś6, 52,69]. Also, traditional caching algorithms depends on the hit and miss ratios of all objects [13,17], whereas such a centralized control will not scale well for serverless functions.…”

Section: Related Workmentioning

confidence: 99%

IceBreaker: warming serverless functions better with heterogeneity

Roy

Patel

Tiwari

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Serverless computing, an emerging computing model, relies on łwarming upž functions prior to its anticipated execution for faster and cost-effective service to users. Unfortunately, warming up functions can be inaccurate and incur prohibitively expensive cost during the warmup period (i.e., keep-alive cost). In this paper, we introduce IceBreaker, a novel technique that reduces the service time and the łkeep-alivež cost by composing a system with heterogeneous nodes (costly and cheaper). IceBreaker does so by dynamically determining the cost-effective node type to warm up a function based on the function's time-varying probability of the next invocation. By employing heterogeneity, IceBreaker allows for more number of nodes under the same cost budget and hence, keeps more number of functions warm and reduces the wait time during high load. Our real-system evaluation confirms that IceBreaker reduces the overall keep-alive cost by 45% and execution time by 27% using representative serverless applications and industry-grade workload trace. IceBreaker is the first technique to employ and leverage the idea of mixing expensive and cheaper nodes to improve both service time and keep-alive cost for serverless functions ś opening up a new research avenue of serverless computing on heterogeneous servers for researchers and practitioners. CCS CONCEPTS• Computer systems organization → Cloud computing; ntier architectures; • General and reference → Performance.

show abstract

Section: Related Workmentioning

confidence: 99%

IceBreaker: warming serverless functions better with heterogeneity

Roy

Patel

Tiwari

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…Temporal prefetchers usually demand hundreds of KBs of storage, which demands the storage of prefetch metadata in the off-chip memory. Some of the recent works on temporal prefetching are in the pursuit of improving the storage overhead without affecting the prefetch coverage [58], [59]. Berti, on the other hand, incurs a storage overhead of just 2.55 KB per core.…”

Section: Related Workmentioning

confidence: 99%

Berti: an Accurate Local-Delta Data Prefetcher

Navarro-Torres

Panda

Alastruey-Benedé

et al. 2022

2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)

View full text Add to dashboard Cite

Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference between the cache line addresses of two demand accesses. Existing delta prefetchers, such as best offset prefetching (BOP) and multi-lookahead prefetching (MLOP), train and predict future accesses based on global deltas. We observed that the use of global deltas results in missed opportunities to anticipate memory accesses.In this paper, we propose Berti, a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas.Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 3.5% compared to IPCP, a state-of-theart prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy.

show abstract

“…Accelerating irregular workloads using hardware prefetchers [37], [54]- [56], [73], [77], [95], [99] has been long studied that cover other types of data structures and memory access patterns containing linked lists, binary trees, hash joins in application domains such as geometric and scientific computations, high-performance computing, and databases. Furthermore, several temporal prefetchers [46], [93], [95], [96] and non-temporal prefetchers [13], [17], [52], [53], [64], [82], [86] are also investigated for these workloads. These approaches however, when applied in the graph processing context, can either prefetch for a subset of data structures or incur high complexity and cost for generality.…”

Section: Related Workmentioning

confidence: 99%

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati

May

Behroozi

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse row/column (CSR/CSC), to conserve space at the cost of complicated, irregular traversals. Such traversals access large volumes of data and offer little locality for caches and conventional prefetchers to exploit. This paper presents Prodigy, a low-cost hardware-software codesign solution for intelligent prefetching to improve the memory latency of several important irregular workloads. Prodigy targets irregular workloads including graph analytics, sparse linear algebra, and fluid mechanics that exhibit two specific types of datadependent memory access patterns. Prodigy adopts a "best of both worlds" approach by using static program information from software, and dynamic run-time information from hardware. The core of the system is the Data Indirection Graph (DIG)-a proposed compact representation used to express program semantics such as the layout and memory access patterns of key data structures. The DIG representation is agnostic to a particular data structure format and is demonstrated to work with several sparse formats including CSR and CSC. Program semantics are automatically captured with a compiler pass, encoded as a DIG, and inserted into the application binary. The DIG is then used to program a low-cost hardware prefetcher to fetch data according to an irregular algorithm's data structure traversal pattern. We equip the prefetcher with a flexible prefetching algorithm that maintains timeliness by dynamically adapting its prefetch distance to an application's execution pace.We evaluate the performance, energy consumption, and transistor cost of Prodigy using a variety of algorithms from the GAP, HPCG, and NAS benchmark suites. We compare the performance of Prodigy against a non-prefetching baseline as well as stateof-the-art prefetchers. We show that by using just 0.8KB of storage, Prodigy outperforms a non-prefetching baseline by 2.6× and saves energy by 1.6×, on average. Prodigy also outperforms modern data prefetchers by 1.5-2.3×.Index Terms-DRAM stalls, irregular workloads, graph processing, hardware-software co-design, programming model, programmer annotations, compiler, and hardware prefetching. Program the prefetcher Program the prefetcher Generate prefetch requests Generate prefetch requestsCompiler analysis Run application Software Hardware Instrumented application binary Application source code Add DIG representation Data Indirection Graph (DIG)

show abstract

Temporal Prefetching Without the Off-Chip Metadata

Cited by 33 publications

References 28 publications

IceBreaker: warming serverless functions better with heterogeneity

IceBreaker: warming serverless functions better with heterogeneity

Berti: an Accurate Local-Delta Data Prefetcher

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Contact Info

Product

Resources

About