Calculating stack distances efficiently

Almási, George; Caşcaval, Călin; Padua, David

doi:10.1145/773039.773043

Cited by 53 publications

(74 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To derive the number of cache misses, we will later use stack distance histograms [4]- [7] within our performance estimation framework. These stack distance histograms only depend on the number of cache sets and the cache line size (which we assume to be fixed).…”

Section: B Stack Distance Histogram Computationmentioning

confidence: 99%

“…Estimation techniques based on the stack distance [3] provide more accuracy, but suffer from the memory overhead. Several methods have been proposed to enable an efficient computation [4]- [7] or approximation [8] of the stack distance. Stack distances and cache miss equations, however, only provide the number of misses and hits for different cache configurations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fast and Precise Cache Performance Estimation for Out-Of-Order Execution

Douma

Altmeyer

Pimentel

2015

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015

View full text Add to dashboard Cite

Abstract-Design space exploration (DSE) is a key ingredient of system-level design, enabling designers to quickly prune the set of possible designs and determine, e.g., the number of the processing cores, the mapping of application tasks to cores, and the core configuration such as the cache organization. High-level performance estimation is a principle component of any systemlevel DSE: it has to be fast and sufficiently precise. Modern out-oforder architectures with caches pose a significant problem to this performance estimation process, as no simple one-to-one mapping of the number of cache misses and resulting cycle time exists. We present a high-level cache performance-estimation framework for out-of-order processors. Evaluation shows that our prediction method is on average 15 times faster than cycleaccurate simulation, while our estimates only show an average error of below 3.5%, reduce the pessimism of a naive highlevel performance estimation by around 66%, and still maintain a high fidelity. Our approach thus enables quick yet accurate performance estimation and extends the applicability of systemlevel DSE to out-of-order processors with caches.

show abstract

Section: B Stack Distance Histogram Computationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Fast and Precise Cache Performance Estimation for Out-Of-Order Execution

Douma

Altmeyer

Pimentel

2015

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015

View full text Add to dashboard Cite

show abstract

“…Computation to determine the distribution can be reduced with efficient algorithms [5] or by approximate analysis [49]. Shi, et al [38] perform single-pass stack simulation to project cache performance and to study the impact of data replication for various L2 cache configurations.…”

Section: Related Workmentioning

confidence: 99%

Reuse-based online models for caches

Sen

Wood

2013

Proceedings of the ACM SIGMETRICS/international Conference on Measurement and Modeling of Computer Systems

View full text Add to dashboard Cite

We develop a reuse distance/stack distance based analytical modeling framework for efficient, online prediction of cache performance for a range of cache configurations and replacement policies LRU, PLRU, RANDOM, NMRU. Our framework unifies existing cache miss rate prediction techniques such as Smith's associativity model, Poisson variants, and hardware way-counter based schemes. We also show how to adapt LRU way-counters to work when the number of sets in the cache changes. As an example application, we demonstrate how results from our models can be used to select, based on workload access characteristics, last-level cache configurations that aim to minimize energy-delay product.

show abstract

“…As derived in Section II-D, our temporal locality essentially represents the same information as reuse distance histograms (cumulative distribution function vs. probability distribution function). Although we can leverage the previous works on fast computation of reuse distances [1], we aim to reduce the computation time by an order of magnitude to make it practical for compiler or runtime profile analysis. Based on the inherent data-level parallelism of our locality computation, we resort to parallel computation on graphics processing units (GPUs [21].…”

Section: A Gpu-based Parallel Algorithm For Locality Computationmentioning

confidence: 99%

Locality principle revisited: A probability-based quantitative approach

Gupta

Xiang

Yang

et al. 2013

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Abstract-This paper revisits the fundamental concept of the locality of references and proposes to quantify it as a conditional probability: in an address stream, given the condition that an address is accessed, how likely the same address (temporal locality) or an address within its neighborhood (spatial locality) will be accessed in the near future. Based on this definition, spatial locality is a function of two parameters, the neighborhood size and the scope of near future, and can be visualized with a 3D mesh. Temporal locality becomes a special case of spatial locality with the neighborhood size being zero byte. Previous works on locality analysis use stack/reuse distances to compute distance histograms as a measure of temporal locality. For spatial locality, some ad-hoc metrics have been proposed as a quantitative measure. In contrast, our conditional probabilitybased locality measure has a clear mathematical meaning, offers justification for distance histograms, and provides a theoretically sound and unified way to quantify both temporal and spatial locality.The proposed locality measure clearly exhibits the inherent application characteristics, from which we can easily derive information such as the sizes of the working data sets and how locality can be exploited. We showcase that our quantified locality visualized in 3D-meshes can be used to evaluate compiler optimizations, to analyze the locality at different levels of memory hierarchy, to optimize the cache architecture to effectively leverage the locality, and to examine the effect of data prefetching mechanisms. A GPU-based parallel algorithm is also presented to accelerate the locality computation for large address traces.

show abstract

Calculating stack distances efficiently

Cited by 53 publications

References 7 publications

Fast and Precise Cache Performance Estimation for Out-Of-Order Execution

Fast and Precise Cache Performance Estimation for Out-Of-Order Execution

Reuse-based online models for caches

Locality principle revisited: A probability-based quantitative approach

Contact Info

Product

Resources

About