Cache index-aware memory allocation

Afek, Yehuda; Dice, Dave; Morrison, Adam

doi:10.1145/1993478.1993486

Cited by 22 publications

(30 citation statements)

References 32 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results show that GCR avoids the scalability collapse, which translates to substantial speedup (up to three orders of magnitude) in case of high lock contention for virtually every evaluated lock, workload and machine. Furthermore, we show empirically that GCR does not harm the fairness of underlying locks (in 1 We also discuss other waiting policies and their limitations later in the paper. fact, in many cases GCR makes the fairness better).…”

Section: Introductionmentioning

confidence: 82%

See 1 more Smart Citation

Avoiding Scalability Collapse by Restricting Concurrency

Dice

Kogan

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall application performance to fade or even drop abruptly. This problem is particularly (but not solely) acute on oversubscribed systems (systems with more threads than available hardware cores).In this paper, we introduce GCR (generic concurrency restriction), a mechanism that aims to avoid the scalability collapse. GCR, designed as a generic, lock-agnostic wrapper, intercepts lock acquisition calls, and decides when threads would be allowed to proceed with the acquisition of the underlying lock. Furthermore, we present GCR-NUMA, a non-uniform memory access (NUMA)-aware extension of GCR, that strives to ensure that threads allowed to acquire the lock are those that run on the same socket.The extensive evaluation that includes more than two dozen locks, three machines and three benchmarks shows that GCR brings substantial speedup (in many cases, up to three orders of magnitude) in case of contention and growing thread counts, while introducing nearly negligible slowdown when the underlying lock is not contended. GCR-NUMA brings even larger performance gains starting at even lighter lock contention.

show abstract

Section: Introductionmentioning

confidence: 82%

“…Thus, limiting the maximum number of threads by the number of cores does not help much. Finally, even when a saturated lock delivers a seemingly stable performance, threads spinning and waiting for the lock consume energy and take resources (such as CPU time) from other, unrelated tasks 1 .…”

Section: Introductionmentioning

confidence: 99%

Avoiding Scalability Collapse by Restricting Concurrency

Dice

Kogan

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…The problem of developing OS-level memory allocators that can perform cache-aware allocations and feature a predictable execution time has been addressed in [Chilimbi et al 2000;Afek et al 2011;Herter et al 2011]. The common denominator for all the aforementioned techniques is to improve the predictability of real-time systems deployed on top of cache-based architectures, in order to provide better isolation guarantees for real-time embedded applications.…”

Section: Existing Solutions and Contributionsmentioning

confidence: 99%

“…Cache-Index Friendly (CIF) also tries to improve the average execution time, instead of improving the predictability, by explicitly controlling the cache-index position of allocated memory blocks [Afek et al 2011]. The central idea in CIF is to insert small spacer regions into the array of blocks within the allocator to better distribute block indices and disrupting the regular ordering of block addresses, returned by the allocator.…”

Section: Cache-aware Allocatorsmentioning

confidence: 99%

A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

et al. 2015

View full text Add to dashboard Cite

Multicore processors are being extensively used by real-time systems, mainly because of their demand for increased computing power. However, multicore processors have shared resources that affect the predictability of real-time systems, which is the key to correctly estimate the worst-case execution time of tasks. One of the main factors for unpredictability in a multicore processor is the cache memory hierarchy. Recently, many research works have proposed different techniques to deal with caches in multicore processors in the context of real-time systems. Nevertheless, a review and categorization of these techniques is still an open topic and would be very useful for the real-time community. In this article, we present a survey of cache management techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research published in 2014. We categorize the main research works and provide a detailed comparison in terms of similarities and differences. We also identify key challenges and discuss future research directions.

show abstract

“…An additional confounding fact is that under certain loads, the MCS unlock operator may execute futile CAS operations that generate unnecessary coherence traffic, and that the unlock operator may need to busy-wait to allow an arriving successor to update the next pointer in the owner's qnode. 2 Threads might also malloc and free queue nodes as needed, but most malloc allocators are not sufficiently scalable. Also, many malloc implementations themselves make use of POSIX locks, resulting in reentry and recursion if a lock implementation were to try to call malloc which in turn would need to acquire a lock.…”

Section: Introductionmentioning

confidence: 99%

TWA – Ticket Locks Augmented with a Waiting Array

Dice

Kogan

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

TWA avoids the complexity and extra accesses incurred by scalable queue-based locks, such as MCS the handover path, providing performance above or beyond that of MCS at high contention.Relative to MCS we also incur fewer coherence misses during handover. Under contention, MCS will generally induce a coherence miss fetching the owner node's Next field, and another miss to set the flag field in the successor's node, whereas our approach causes only one miss in the critical handover path.we observe in passing and without further comment ... * favor; at the expense of * avoid dilemma * MCS : Each arriving thread uses an atomic instruction to append a "node" (representing that thread) to tail of a chain of waiting threads, forming an explicit queue, and then busy waits on a field within that node. * MCS : node is proxy for thread * MCS : chain ; explicit queue of nodes where node represents waiting thread. * MCS is usually considered as an alternative to ticket locks. * succession : handoff vs handover * QoI = Quality-of-implementation issue * SPARC oddities : no 64-bit SWAP; no FAA; emulate all with CAS; MOESI * Claim : performance TWA » Max(MCS, TKT) * TWA unlock path is slightly longer than TKT, but handoff is accomplished first/early. * Ambient; native; free-range; * Supports our claim ... * models; reflects; mirrors; * Waiting threads interfere with handover * wrap around; rollover; overflow; aliasing; ABA; rollover recurrence could result in progress failure and hang waiting thread fails to exii long-term waiting mode missed wakeup -LT thread fails to observe change in WA[x] * Optimization : really "cycle shaving" * Influence of API design on lock : @ pass/convey from lock-to-unlock : pass/convey May need to add extra field in lock body to pass That field may induce additional coherence traffic @ scoped locking; lexically balanced -allows on-stack queue nodes @ express as closure : like java synchronized block * Inter-lock hash collisions vs intra-lock collisions * Ticket locks made less repulsive * overall path length increases; but critical path decreases * Cycle shaving game * Ticket Locks are context-free locks. * latency withing contended CS implies scalability ! * Readers are not simply passive observers ; Quantum : readers change state * Not strictly deterministic; reliable performance * Collision : false notification * point-to-point 1:1 vs one-to-many communcation * Amenable to MONITOR-MWAIT * Wait-away; wait-aside * dissipate central contention * TKTWA5 : invisible readers/waiters TKTWA7 : visible readers/waiters * Analogy : bakery/deli with TicketAllocator and NowServing Crowd/mob around NowServing impedes handover MCS is a true queue -line TWA : assign waiting place based on ticket Move to counter when turn is near * Waiting Array values are hints -advisory Can not reason directly about tranfer of ownership from absence or presence of values on the waiting array. Only changes. Waiting thread must always consult ground truth in "grant" field. * Ticket lock field names Ticket-Grant; Head-Tail; Request-...

show abstract

Cache index-aware memory allocation

Cited by 22 publications

References 32 publications

Avoiding Scalability Collapse by Restricting Concurrency

Avoiding Scalability Collapse by Restricting Concurrency

A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

TWA – Ticket Locks Augmented with a Waiting Array

Contact Info

Product

Resources

About