DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

Choi, Byn; Komuravelli, Rakesh; Sung, Hyojin; Smolinski, Robert; Honarmand, Nima; Adve, Sarita V.; Adve, Vikram; Carter, Nicholas P.; Chou, Ching-Tsun

doi:10.1109/pact.2011.21

Cited by 150 publications

(150 citation statements)

References 53 publications

Supporting

Mentioning

147

Contrasting

Order By: Relevance

“…Several cache-coherence optimizations reduce the cost of updates, though that is not their primary purpose: self-invalidations, done with either hardware predictors [43] or software protocols [16,33], remove invalidations from the critical path; adaptive-granularity coherence schemes [38,67,71] reduce both false sharing and the amount of dirty data sent on invalidations; and speculation and fast networks can reduce the cost of atomic operations [27]. These schemes are orthogonal to Coup, which could be used in conjunction with them to improve performance.…”

Section: Additional Related Workmentioning

confidence: 99%

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Zhang

Horn

Sánchez

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with updateonly permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model.We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.

show abstract

Section: Additional Related Workmentioning

confidence: 99%

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Zhang

Horn

Sánchez

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

“…SC-for-DRF protocols rely on the guarantee that, during DRF regions, threads perform either private or read-only memory accesses [1], [2], [20]. A memory access is private if it targets a memory location that is only accessed by one thread during the execution of one DRF region; and is read-only if the location is not written within the DRF region.…”

Section: A Sequential Consistency For Drf Protocolsmentioning

confidence: 99%

“…This excessive invalidation limits their performance [1], [2]. In contrast, SPEL reduces self-invalidation, by relying on the compiler to indicate the points of synchronization that indeed require self-invalidating cached data.…”

Section: A Sequential Consistency For Drf Protocolsmentioning

confidence: 99%

“…However, not only traditional protocols perform sub-optimally on modern architectures, but their inefficiency escalates as the number of cores in the system grows. State of the art coherence protocols seek to detect and exploit memory accessing characteristics of code with the goal of simplifying the protocol while delivering scalability and performance [1], [2]. Despite the promising results, such protocols exhibit limitations (for instance, fail at providing support for legacy code), and were therefore disregarded for being integrated in the emerging architectures (e.g., Intel Xeon Phi [3]), which still implement traditional, inefficient, directory-based cache coherence protocols.…”

Section: Introductionmentioning

confidence: 99%

“…However, it comes at the cost of performance limitations, especially when the system provides a more relaxed consistency model [4]. In answer, modern coherence protocols follow the sequential consistency for data-race-free (SC for DRF) model [5], which allows a simpler and more scalable design [1], [2] and improves performance [6]. Nevertheless, a major drawback is that such protocols do not provide backwards compatibility with existing software that requires a stronger consistency model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Dual-Consistency Cache Coherence Protocol

Ros

Jimborean

2015

2015 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Weak memory consistency models can maximize system performance by enabling hardware and compiler optimizations, but increase programming complexity since they do not match programmers' intuition. The design of an efficient system with an intuitive memory model is an open challenge. This paper proposes SPEL, a dual-consistency cache coherence protocol which simultaneously guarantees the strongest memory consistency model provided by the hardware and yields improvements in both performance and energy consumption. The design of the protocol exploits a compile-time identification of code regions which can be executed under a less restrictive, thus optimized protocol, without harming correctness. Outside these regions, code is executed under a more restrictive protocol which enforces sequential consistency. Compared to a standard directory protocol, we show improvements in performance of 24% and reductions in energy consumption of 32%, on average, for a 64-core chip multiprocessor.

show abstract

The Road for 2D Semiconductors in the Silicon Age

2022

View full text Add to dashboard Cite

Continued reduction in transistor size can improve the performance of silicon integrated circuits (ICs). However, as Moore's law approaches physical limits, high‐performance growth in silicon ICs becomes unsustainable, due to challenges of scaling, energy efficiency, and memory limitations. The ultrathin layers, diverse band structures, unique electronic properties, and silicon‐compatible processes of 2D materials create the potential to consistently drive advanced performance in ICs. Here, the potential of fusing 2D materials with silicon ICs to minimize the challenges in silicon ICs, and to create technologies beyond the von Neumann architecture, is presented, and the killer applications for 2D materials in logic and memory devices to ease scaling, energy efficiency bottlenecks, and memory dilemmas encountered in silicon ICs are discussed. The fusion of 2D materials allows the creation of all‐in‐one perception, memory, and computation technologies beyond the von Neumann architecture to enhance system efficiency and remove computing power bottlenecks. Progress on the 2D ICs demonstration is summarized, as well as the technical hurdles it faces in terms of wafer‐scale heterostructure growth, transfer, and compatible integration with silicon ICs. Finally, the promising pathways and obstacles to the technological advances in ICs due to the integration of 2D materials with silicon are presented.

show abstract

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

Cited by 150 publications

References 53 publications

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

A Dual-Consistency Cache Coherence Protocol

The Road for 2D Semiconductors in the Silicon Age

Contact Info

Product

Resources

About