Predicting whole-program locality through reuse distance analysis

Chen, Ding; Zhong, Yutao

doi:10.1145/781158.781159

Cited by 76 publications

(157 citation statements)

References 24 publications

Supporting

Mentioning

156

Contrasting

Order By: Relevance

“…One key difference between the work in this area and our work is in the metrics used for transformations. Cache transformations are based on metrics like reuse distance [12] or stack distance [7]. In comparison, we target programmer controlled memory hierarchy levels, and applications where we primarily see capacity misses.…”

Section: Related Workmentioning

confidence: 99%

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Krishnamoorthy

Agrawal

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.

show abstract

Section: Related Workmentioning

confidence: 99%

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Krishnamoorthy

Agrawal

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…However, for an important category of cache, namely those of random replacement policy, as the replacement policy randomly determines one among multiple candidates for eviction, the naïve simulation only constitutes a Monte Carlo simulation, therefore can only give one out of many possible results. In particular, in a single round, naïve simulation cannot give the hit probability of each cache reference, which is of particular interest in program analysis [3] [4]. If we maintain n copies of possible cache states and simulate access sequences on these states simultaneously, then the time and space requirement of the simulation will be equal to running n copies of naïve simulation in parallel, with no gain in efficiency.…”

Section: Introductionmentioning

confidence: 99%

“…2 Even if the replacement algorithm takes care to not evict valid data when there are free slots, as the cache is soon filled up with valid data, there will be no difference in practice. 3 Sometimes referred to as pseudo random replacement policy due to difficulty, if not impossibility, of obtaining true randomness.…”

Section: Introductionmentioning

confidence: 99%

An Efficient Simulation Algorithm for Cache of Random Replacement Policy

Zhou

2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Cache is employed to exploit the phenomena of locality in many modern computer systems. One way of evaluating the impact of cache is to run a simulator on traces collected from realistic work load. However, for an important category of cache, namely those of random replacement policy, each round of the naïve simulation can only give one out of many possible results, therefore requiring many rounds of simulation to capture the cache behavior, like determining the hit probability of a particular cache reference. In this paper, we present an algorithm that efficiently approximates the hit probability in linear time with moderate space in a single round. Our algorithm is applicable to realistic processor cache parameters where the associativity is typically low, and extends to cache of large associativity. Experiments show that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation.

show abstract

“…Some studied more structured programs and finer grained data-data reuses within and across loop nests [28], per-statement [29], and across program inputs [14]. These studies show that long-distance data reuses cause cache misses, but they do not show how well we can improve the locality of data access.…”

Section: Related Workmentioning

confidence: 99%

“…A trace may not represent the program behavior on other inputs, and a trace may be too large to be analyzed. For many programs, earlier work has shown that the temporal locality follows a predictable pattern and the (cache miss) behavior of all program inputs can be predicted by examining medium-size training runs [14,15,26,34,44]. In this paper, we use a medium-size input for each program.…”

Section: Introductionmentioning

confidence: 99%

The Potential of Computation Regrouping for Improving Locality

Ding

Orlovich

Proceedings of the ACM/IEEE SC2004 Conference

View full text Add to dashboard Cite

Improving program locality has become increasingly important on modern computer systems. An effective strategy is to group computations on the same data so that once the data are loaded into cache, the program performs all their operations before the data are evicted. However, computation regrouping is difficult to automate for programs with complex data and control structures.This paper studies the potential of locality improvement through trace-driven computation regrouping. First, it shows that maximizing the locality is different from maximizing the parallelism or maximizing the cache utilization. The problem is NP-hard even without considering data dependences and cache organization. Then the paper describes a tool that performs constrained computation regrouping on program traces. The new tool is unique because it measures the exact control dependences and applies complete memory renaming and re-allocation. Using the tool, the paper measures the potential locality improvement in a set of commonly used benchmark programs written in C.

show abstract

Predicting whole-program locality through reuse distance analysis

Cited by 76 publications

References 24 publications

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

An Efficient Simulation Algorithm for Cache of Random Replacement Policy

The Potential of Computation Regrouping for Improving Locality

Contact Info

Product

Resources

About