Revisiting the combining synchronization technique

Fatourou, Panagiota; Kallimanis, Nikolaos D.

doi:10.1145/2370036.2145849

Cited by 71 publications

(123 citation statements)

References 19 publications

Supporting

Mentioning

122

Contrasting

Order By: Relevance

“…Besides the optimized server-based solutions that implement Algorithm 1, we also evaluate CC-Synch [5], as a representative of combining approaches, as well as H-Synch, its NUMA-aware version. H-Synch follows the general idea of grouping operations originating from the same node and executing them together in batches, thus incurring fewer cross-socket cache line transfers and significantly increasing throughput.…”

Section: Methodsmentioning

confidence: 99%

“…This turned out to result in unfavorable interference in our experiments, which we avoid by skipping every second cache line when allocating client slots. In experiments where memory management is needed (stacks and queues), cache-aligned memory chunks are allocated and deallocated using per-thread pools (we use the implementation provided by the authors of CC-Synch [5]). …”

Section: Methodsmentioning

confidence: 99%

“…Combining [5,6,9,14] avoids dedicating a core to CS execution statically. Instead, when a thread gets the lock associated with a shared object, it can execute CSes on behalf of other threads in addition to its own.…”

Section: Related Workmentioning

confidence: 99%

“…This is mostly because every core has to bring the counter to the local cache in order to increment it, so the cache line containing the counter bounces between opera- Figure 10: Performance of concurrent stacks (initially empty) under balanced load (every thread alternates between push and pop). srv-* -server-based implementations; CC-Synch, H-Synch -combining implementations [5]; treiber -Treiber's nonblocking stack [20] tions, which often includes a cross-socket transfer. On the Xeon, we can also see that fetch-and-add performance, after a period of stability, suddenly grows again with more than 80 threads.…”

Section: Concurrent Data Structuresmentioning

confidence: 99%

“…Mutual exclusion is one of the basic concepts in sharedmemory programming. Delegation is a well-known technique to efficiently implement mutual exclusion on shared objects [5,6,9,10,14,19]. In delegation, a server thread, typically pinned to a processor core to improve data locality, sequentially executes requests sent by other, client threads.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

On the Performance of Delegation over Cache-Coherent Shared Memory

Petrović

Ropars

Schiper

2015

Proceedings of the 16th International Conference on Distributed Computing and Networking

View full text Add to dashboard Cite

Delegation is a thread synchronization technique where access to shared data is performed through a dedicated server thread. When a client thread requires shared data access, it makes a request to a server and waits for a response. This paper studies delegation implementation over cache-coherent shared memory, with the goal of optimizing it for high throughput. Whereas client-server communication naturally fits message-passing systems, efficient implementation over cache-coherent shared memory requires careful optimization. We demonstrate optimizations that significantly improve delegation performance on two modern x86 processors (the Intel Xeon Westmere and the AMD Opteron Magny-Cours), enabling us to come up with counter, stack and queue implementations that outperform the best known alternatives in a large number of cases. Our optimized delegation solution achieves 1.4x (resp. 2x) higher throughput compared to the most efficient state-of-the-art delegation solution on the Intel Xeon (resp. AMD Opteron).

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Concurrent Data Structuresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On the Performance of Delegation over Cache-Coherent Shared Memory

Petrović

Ropars

Schiper

2015

Proceedings of the 16th International Conference on Distributed Computing and Networking

View full text Add to dashboard Cite

show abstract

Software‐based contention management for efficient compare‐and‐swap operations

Dice

Hendler

Mirsky

2014

Concurrency and Computation

View full text Add to dashboard Cite

Many concurrent data-structure implementations -both blocking and non-blocking -use the well-known compare-and-swap (CAS) operation, supported in hardware by most modern multiprocessor architectures, for inter-thread synchronization. A key weakness of the CAS operation is its performance in the presence of memory contention. When multiple threads concurrently attempt to apply CAS operations to the same shared variable, at most a single thread will succeed in changing the shared variable's value and the CAS operations of all other threads will fail. Moreover, significant degradation in performance occurs when variables manipulated by CAS become contention 'hot spots', because failed CAS operations congest the interconnect and memory devices and slow down successful CAS operations. In this work, we study the following question: can software-based contention management improve the efficiency of hardware-provided CAS operations? In other words, can a software contention management layer, encapsulating invocations of hardware CAS instructions, improve the performance of CAS-based concurrent data structures? To address this question, we conduct what is, to the best of our knowledge, the first study on the impact of contention management algorithms on the efficiency of the CAS operation. We implemented several Java classes, that extend Java's AtomicReference class, and encapsulate calls to the native CAS instruction with simple contention management mechanisms tuned for different hardware platforms. A key property of our algorithms is the support for an almost-transparent interchange with Java's AtomicReference objects, used in implementations of concurrent data structures. We evaluate the impact of these algorithms on both a synthetic micro-benchmark and on CAS-based concurrent implementations of widely-used data structures such as stacks and queues. Our performance evaluation establishes that lightweight software-based contention management support can greatly improve performance under medium and high contention levels while typically incurring only small overhead under low contention. In some cases, applying efficient contention management for CAS operations used by a simpler data-structure implementation yields better results than highly optimized implementations of the same data structure that use native CAS operations directly.

show abstract

Enabling semantics to improve detection of data races and misuses of lock‐free data structures

Dolz

Astorga

Fernández

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary The rapid progress of multi/many‐core architectures has caused data‐intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering structured patterns have alleviated developers' burden adapting such applications to multithreaded architectures. While some of these patterns are implemented using synchronization primitives, others avoid them by means of lock‐free data mechanisms. However, lock‐free programming is not straightforward, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels can interfere in the occurrence of data races. The benefits of race detectors are formidable in this sense; however, they may emit false positives if are unaware of the underlying lock‐free structure semantics. To mitigate this issue, this paper extends ThreadSanitizer, a race detection tool, with the semantics of 2 lock‐free data structures: the single‐producer/single‐consumer and the multiple‐producer/multiple‐consumer queues. With it, we are able to drop false positives and detect potential semantic violations. The experimental evaluation, using different queue implementations on a set of μ benchmarks and real applications, demonstrates that it is possible to reduce, on average, 60% the number of data race warnings and detect wrong uses of these structures.

show abstract

Revisiting the combining synchronization technique

Cited by 71 publications

References 19 publications

On the Performance of Delegation over Cache-Coherent Shared Memory

On the Performance of Delegation over Cache-Coherent Shared Memory

Software‐based contention management for efficient compare‐and‐swap operations

Enabling semantics to improve detection of data races and misuses of lock‐free data structures

Contact Info

Product

Resources

About