Eliminating conflict misses for high performance architectures

Rivera, Gabriel; Tseng, Chau‐Wen

doi:10.1145/277830.277917

Cited by 40 publications

(32 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead, most optimizations have focused on exploiting temporal and spatial reuse within individual loop nests [21,33]. Tiling is usually not needed, since most locality can be obtained through loop permutation, though in some cases array padding may be necessary to preserve group reuse [25].…”

Section: Tiling For Stencil Codesmentioning

confidence: 99%

Tiling Optimizations for 3D Scientific Computations

Rivera¹,

Tseng²

2000

ACM/IEEE SC 2000 Conference (SC'00)

139

133

View full text Add to dashboard Cite

Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.

show abstract

Section: Tiling For Stencil Codesmentioning

confidence: 99%

Tiling Optimizations for 3D Scientific Computations

Rivera¹,

Tseng²

2000

ACM/IEEE SC 2000 Conference (SC'00)

139

133

View full text Add to dashboard Cite

show abstract

“…Instability comes from the so-called pathological array sizes, when array dimensions are near powers of two, since cache interference is a particular risk at that point. Array padding [8], [13], [16] is a compiler optimization that increases the array sizes and changes initial locations to avoid pathological cases. It introduces space overhead but effectively stabilizes program performance.…”

Section: Related Workmentioning

confidence: 99%

“…Although the results are accurate, the time needed to obtain them is typically many times greater than the total execution time of the program being simulated. To try to overcome such problems, analytical models of cache behaviour combined with heuristics have also been developed, to guide optimizing compilers [6], [16] and [23], or study the cache performance of particular types of algorithm, especially blocked ones [3], [7], [10], and [22]. Code optimizations, such as tile size selection, selected with the help of predicted miss ratios require a really accurate assessment of program's code behaviour.…”

Section: Related Workmentioning

confidence: 99%

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Athanasaki

Kourtis

Anastopoulos

et al. 2005

Advances in Informatics

View full text Add to dashboard Cite

Abstract. Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes.In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self-or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.

show abstract

“…Wolf et al [34] consider the integrated treatment of fusion and tiling only from the point of view of enhancing locality and do not consider the impact of the amount of required memory; the memory requirement is a key issue for the problems considered in this paper. Loop tiling for enhancing data locality has been studied extensively [27,33,30], and analytic models of the impact of tiling on locality have been developed [7,20,25]. Recently, a data-centric version of tiling called data shackling has been developed [12,13] (together with more recent work by Ahmed et al [1]) which allows a cleaner treatment of locality enhancement in imperfectly nested loops.…”

Section: Related Workmentioning

confidence: 99%

Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

Cociorva

Wilkins

Baumgartner

et al. 2001

High Performance Computing — HiPC 2001

View full text Add to dashboard Cite

Abstract. The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contractions and arise in electronic structure calculations. This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures. We focus on an approach to performing data locality optimization in this context. Preliminary experimental results on an SGI Origin 2000 are encouraging and demonstrate that the approach is effective.

show abstract

Eliminating conflict misses for high performance architectures

Cited by 40 publications

References 22 publications

Tiling Optimizations for 3D Scientific Computations

Tiling Optimizations for 3D Scientific Computations

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

Contact Info

Product

Resources

About