Locality optimizations for multi-level caches

Rivera, Gabriel; Tseng, Chau‐Wen

doi:10.1145/331532.331534

Cited by 32 publications

(28 citation statements)

References 37 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have considered a different architecture, and our conclusions are different. Rivera and Tseng examined loop transformations for multilevel caches, and finding that all performance gains can be achieved by simply focusing on L1 cache [32]. Clearly, as we have considered architectures with a different type of multi-level memory hierarchy, our conclusions are different.…”

Section: Related Workmentioning

confidence: 91%

See 1 more Smart Citation

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Krishnamoorthy

Agrawal

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.

show abstract

Section: Related Workmentioning

confidence: 91%

“…Similarly, there has been some work on optimizing data movements from main memory to device memory [35,34,22]. As multi-level processor caches became very common in mid-nineties, several compiler efforts considered optimizations for them [25,32,30].…”

Section: Introductionmentioning

confidence: 99%

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Krishnamoorthy

Agrawal

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Chame and Moon [8] developed techniques to minimize the sum of the capacity and cross-interference misses while avoiding self-interference misses. Rivera and Tseng [26] developed padding techniques to reduce interference misses and studied the effect of multi-level caches on data locality optimizations. Hsu and Kremer [16] presented a comprehensive comparative study of tile size selection algorithms.…”

Section: Related Workmentioning

confidence: 99%

“…In analytical approaches, a compiler selects tile sizes based on static analysis of loop nests and known characteristics of the memory hierarchy. Although several analytical techniques for tile size selection have been proposed in the literature [8,10,13,16,19,26,27,28], none has been demonstrated to be sufficiently effective for use in practice. As a result, the gap between the performance delivered by the best known tile sizes and those selected by an analytical approach has continued to widen, thereby diminishing the utility of past analytical approaches.…”

Section: Introductionmentioning

confidence: 99%

Analytical Bounds for Optimal Tile Size Selection

Shirako

Sharma

Fauzia

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. In this paper, we introduce a novel approach to guide tile size selection by employing analytical models to limit empirical search within a subspace of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307× to 11,879× for an Intel Core-2-Quad system; 358× to 1,978× for an Intel Nehalem system; and 45× to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40× (Intel Core-2 Quad), 1.28× (Nehalem) and 1.19× (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search optimization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95× and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code.

show abstract

“…In order to quantify the benefits of adopting nonlinear layouts to reduce cache misses, there exist several different approaches. In [18], Rivera et al considers all levels of memory hierarchy to reduce L2 cache misses as well, rather than reducing only L1 ones. He presents even fewer overall misses, however performance improvements are rarely significant.…”

Section: Related Workmentioning

confidence: 99%

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Athanasaki

Kourtis

Anastopoulos

et al. 2005

Advances in Informatics

View full text Add to dashboard Cite

Abstract. Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes.In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self-or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.

show abstract

Locality optimizations for multi-level caches

Cited by 32 publications

References 37 publications

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Analytical Bounds for Optimal Tile Size Selection

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Contact Info

Product

Resources

About