Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies

Jaleel, Aamer; Borch, Eric; Bhandaru, Malini; Steely, Simon C.; Emer, Joel

doi:10.1109/micro.2010.52

Cited by 101 publications

(72 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, when the shared cache evicts a block with non-empty tracking bits, it is required to send a recall message to each private cache that is caching the block, adding to system traffic. More insidiously, such recalls can increase the cache miss rate by forcing cores to evict hot blocks they are actively using [11]. To ensure scalability, we seek a system that make recalls vanishingly rare, the design of which first requires understanding the reasons why recalls occur.…”

Section: Concern #3: Maintaining Inclusionmentioning

confidence: 99%

Why on-chip cache coherence is here to stay

2012

View full text Add to dashboard Cite

Today's multicore chips commonly implement shared memory with cache coherence as low-level support for operating systems and application software. Technology trends continue to enable the scaling of the number of (processor) cores per chip. Because conventional wisdom says that the coherence does not scale well to many cores, some prognosticators predict the end of coherence. This paper refutes this conventional wisdom by showing one way to scale on-chip cache coherence with bounded costs by combining known techniques such as: shared caches augmented to track cached copies, explicit cache eviction notifications, and hierarchical design. Based upon our scalability analysis of this proof-of-concept design, we predict that on-chip coherence and the programming convenience and compatibility it provides are here to stay. 1994, 1995, 1998, 2002, 2009 by ACM, Inc. Permission to copy and distribute this document is hereby granted provided that this notice is retained on all copies, that copies are not altered, and that ACM is credited when the material is used to form other copyright policies. Today's multicore chips commonly implement shared memory with cache coherence as low-level support for operating systems and application software. Technology trends continue to enable the scaling of the number of (processor) cores per chip. Because conventional wisdom says that the coherence does not scale well to many cores, some prognosticators predict the end of coherence.This paper refutes this conventional wisdom by showing one way to scale on-chip cache coherence with bounded costs by combining known techniques such as: shared caches augmented to track cached copies, explicit cache eviction notifications, and hierarchical design. Based upon our scalability analysis of this proof-ofconcept design, we predict that on-chip coherence and the programming convenience and compatibility it provides are here to stay.

show abstract

Section: Concern #3: Maintaining Inclusionmentioning

confidence: 99%

Why on-chip cache coherence is here to stay

2012

View full text Add to dashboard Cite

show abstract

“…Many other papers have also looked at exclusive [Barosso et al 2000] and noninclusive [Jaleel et al 2010] cache hierarchies; however, they focus on server-class and generalpurpose processors with considerably different workloads than this particular study.…”

Section: Memory Coherence and Consistencementioning

confidence: 99%

Virtual Ways: Low-Cost Coherence for Instruction Set Extensions with Architecturally Visible Storage

Kluter

Burri

Brisk

et al. 2014

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Instruction set extensions (ISEs) improve the performance and energy consumption of application-specific processors. ISEs can use architecturally visible storage (AVS), localized compiler-controlled memories, to provide higher I/O bandwidth than reading data from the processor pipeline. AVS creates coherence and consistence problems with the data cache. Although a hardware coherence protocol could solve the problem, this approach is costly for a single-processor system. As a low-cost alternative, we introduce Virtual Ways, which ensures coherence through a reduced form of inclusion between the data cache and AVS. Virtual Ways achieve higher performance and lower energy consumption than using a hardware coherence protocol.

show abstract

“…In an inclusive cache hierarchy, as shown in Figure 1(c), cache blocks stored in the L1 cache should also be stored in the L2 cache. When a block is evicted from the L2 cache, the corresponding block in the L1 cache (if present) has to be invalidated to maintain inclusion (referred to as back-invalidation [Jaleel et al 2010]). Thus, the capacity of the whole inclusive cache hierarchy equals to capacity of its LLC (the L2 cache in this example).…”

Section: Motivationmentioning

confidence: 99%

“…Previous work [Jaleel et al 2010] identified blocks that have high temporal locality in higher-level caches and reduced the frequency of back-invalidating them, which makes performance of inclusive cache hierarchies achieve that of noninclusive caches. However, blocks that have poor temporal locality in higher-level caches may still have temporal locality in the LLC, and the replacement of these blocks will still hurt the overall performance.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Temporal-based multilevel correlating inclusive cache replacement

Tian

Khan

Jiménez

2013

TACO

View full text Add to dashboard Cite

Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due to ignorance of temporal locality of the Last-Level Cache (LLC). Blocks that are highly referenced (referred to as hot blocks) are always hit in higher-level caches (e.g., L1 cache) and are rarely referenced in the LLC. Therefore, they become replacement victims in the LLC. Due to the inclusion property, blocks evicted from the LLC have to also be invalidated from higher-level caches. Invalidation of hot blocks from the entire cache hierarchy introduces costly off-chip misses that makes the inclusive cache perform poorly.Neither blocks that are highly referenced in the LLC nor blocks that are highly referenced in higherlevel caches should be the LLC replacement victims. We propose temporal-based multilevel correlating cache replacement for inclusive caches to evict blocks in the LLC that are also not hot in higher-level caches using correlated temporal information acquired from all levels of a cache hierarchy with minimal overhead. Invalidation of these blocks does not hurt the performance. By contrast, replacing them as early as possible with useful blocks helps improve cache performance. Based on our experiments, in a dual-core CMP, an inclusive cache with temporal-based multilevel correlating cache replacement significantly outperforms an inclusive cache with traditional LRU replacement by yielding an average speedup of 12.7%, which is comparable to an enhanced noninclusive cache, while requiring less than 1% of storage overhead.

show abstract

Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies

Cited by 101 publications

References 14 publications

Why on-chip cache coherence is here to stay

Why on-chip cache coherence is here to stay

Virtual Ways: Low-Cost Coherence for Instruction Set Extensions with Architecturally Visible Storage

Temporal-based multilevel correlating inclusive cache replacement

Contact Info

Product

Resources

About