Strategies for cache and local memory management by global program transformation

Gannon, Dennis; Jalby, William; Gallivan, Kyle A.

doi:10.1007/3-540-18991-2_14

Cited by 56 publications

(67 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wolf and Lam provide a concise definition and summary of important types of data locality [33]. Computation-reordering transformations such as loop permutation and tiling are the primary optimization techniques [9,21,33], though loop fission (distribution) and loop fusion have also been found to be helpful [21].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Tiling Optimizations for 3D Scientific Computations

Rivera¹,

Tseng²

2000

ACM/IEEE SC 2000 Conference (SC'00)

139

133

View full text Add to dashboard Cite

Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Several cachecapacity estimation techniques have been proposed to help guide data locality optimizations [9,33]. These techniques can also be enhanced to take into account limited cache associativity [8,30].…”

Section: Related Workmentioning

confidence: 99%

Tiling Optimizations for 3D Scientific Computations

Rivera¹,

Tseng²

2000

ACM/IEEE SC 2000 Conference (SC'00)

139

133

View full text Add to dashboard Cite

show abstract

“…Given that the number of octrees is larger than the number of cores by several orders of magnitude, one important aspect of our strategy is that each octree is assigned to a single thread and is thus processed sequentially. This allows us to optimize the octree traversal with respect to cache usage [14,15]: leaf cells are explored along a Z-order curve to minimize cache misses, as illustrated in Figure 3 (red path).…”

Section: Software Architecturementioning

confidence: 99%

Combining Task-based Parallelism and Adaptive Mesh Refinement Techniques in Molecular Dynamics Simulations

Prat

Colombet

Namyst

2018

Proceedings of the 47th International Conference on Parallel Processing

View full text Add to dashboard Cite

Modern parallel architectures require applications to generate massive parallelism so as to feed their large number of cores and their wide vector units. We revisit the extensively studied classical Molecular Dynamics N-body problem in the light of these hardware constraints. We use Adaptive Mesh Refinement techniques to store particles in memory, and to optimize the force computation loop using multi-threading and vectorization-friendly data structures. Our design is guided by the need for load balancing and adaptivity raised by highly dynamic particle sets, as typically observed in simulations of strong shocks resulting in material micro-jetting. We analyze performance results on several simulation scenarios, over nodes equipped by Intel Xeon Phi Knights Landing (KNL) or Intel Xeon Skylake (SKL) processors. Performance obtained with our OpenMP implementation outperforms state-of-the-art implementations (LAMMPS) on both steady and micro-jetting particles simulations. In the latter case, our implementation is 4.7 times faster on KNL, and 2 times faster on SKL.

show abstract

“…In order to decide which memory lines to load, we compute, for each variable, the range of addresses that it accesses. When analyzing array variables, we use the concept of uniformly generated references (UGR) [Gannon et al 1988] to decide which part of the array is accessed within the region. Two references are called uniformly generated when their array subscripts are affine and differ at most in their constant terms [Gannon et al 1988].…”

Section: Selecting Data To Lock In the Cache (Loaddata)mentioning

confidence: 99%

“…When analyzing array variables, we use the concept of uniformly generated references (UGR) [Gannon et al 1988] to decide which part of the array is accessed within the region. Two references are called uniformly generated when their array subscripts are affine and differ at most in their constant terms [Gannon et al 1988]. 6 At line 3 we classify all memory references to the studied variable V into uniformly generated classes.…”

Section: Selecting Data To Lock In the Cache (Loaddata)mentioning

confidence: 99%

Data cache locking for tight timing calculations

Vera

Lisper

Xue

2007

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpredictability, resulting in programs sometimes behaving in a different way than expected.Detailed information about the number of cache misses and their causes allows us to predict cache behavior and to detect bottlenecks. Small modifications in the source code may change memory patterns, thereby altering the cache behavior. Code transformations which take the cache behavior into account might result in a high cache performance improvement. However, cache memory behavior is very hard to predict, thus making the task of optimizing and timing cache behavior very difficult.This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems. Our method explores the use of cache partitioning and dynamic cache locking to provide worst-case performance estimates in a safe and tight way for multitasking systems. We use cache partitioning, which divides the cache among tasks to eliminate inter-task cache interferences. We combine static cache analysis and cache locking mechanisms to ensure that all intra-task conflicts, and consequently, memory access times, are exactly predictable.The results of our experiments demonstrate the capability of our framework to describe cache behavior at compile time. We compare our timing approach with a system equipped with a nonpartitioned but statically locked data cache. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, demonstrating that our fully predictable scheme does not compromise the performance of the transformed programs.

show abstract

Strategies for cache and local memory management by global program transformation

Cited by 56 publications

References 10 publications

Tiling Optimizations for 3D Scientific Computations

Tiling Optimizations for 3D Scientific Computations

Combining Task-based Parallelism and Adaptive Mesh Refinement Techniques in Molecular Dynamics Simulations

Data cache locking for tight timing calculations

Contact Info

Product

Resources

About