NUMA-aware reader-writer locks

Calciu, Irina; Dice, Dave; Lev, Yossi; Luchangco, Victor; Marathe, Virendra J.; Shavit, Nir

doi:10.1145/2442516.2442532

Cited by 67 publications

(31 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Even our optimized NO WAIT implementation does not scale as well as SILO due to contention caused by atomic instructions used in the read-write lock implementation (Figure 2). Designing scalable, NUMA-aware read-write lock is a topic of intense research in the concurrent programming community [6,10,16]. Using such locks to further minimize the impact of physical synchronization in both pessimistic and optimistic protocols is a promising direction of future research.…”

Section: Implications Of Our Analysismentioning

confidence: 99%

Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads

et al. 2017

View full text Add to dashboard Cite

Main-memory OLTP engines are being increasingly deployed on multicore servers that provide abundant thread-level parallelism. However, recent research has shown that even the state-of-the-art OLTP engines are unable to exploit available parallelism for high contention workloads. While previous studies have shown the lack of scalability of all popular concurrency control protocols, they consider only one system architecture-a non-partitioned, shared everything one where transactions can be scheduled to run on any core and can access any data or metadata stored in shared memory.In this paper, we perform a thorough analysis of the impact of other architectural alternatives (Data-oriented transaction execution, Partitioned Serial Execution, and Delegation) on scalability under high contention scenarios. In doing so, we present Trireme, a main-memory OLTP engine testbed that implements four system architectures and several popular concurrency control protocols in a single code base. Using Trireme, we present an extensive experimental study to understand i) the impact of each system architecture on overall scalability, ii) the interaction between system architecture and concurrency control protocols, and iii) the pros and cons of new architectures that have been proposed recently to explicitly deal with high-contention workloads.

show abstract

Section: Implications Of Our Analysismentioning

confidence: 99%

Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads

et al. 2017

View full text Add to dashboard Cite

show abstract

“…For example, our pipeline runs 8 instances (two per socket) of the CNN code for membrane detection, where each instance uses 9 cores. This enabled efficient use of the caches on each socket and eliminated the need to handle complex NUMA overheads [5,13,14,16,47].…”

Section: Scalable Software Saves Memorymentioning

confidence: 99%

A Multicore Path to Connectomics-on-Demand

Shavit

2016

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

Self Cite

View full text Add to dashboard Cite

The current design trend in large scale machine learning is to use distributed clusters of CPUs and GPUs with MapReduce-style programming. Some have been led to believe that this type of horizontal scaling can reduce or even eliminate the need for traditional algorithm development, careful parallelization, and performance engineering. This paper is a case study showing the contrary: that the benefits of algorithms, parallelization, and performance engineering, can sometimes be so vast that it is possible to solve "clusterscale" problems on a single commodity multicore machine.Connectomics is an emerging area of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will require mass storage, farms of CPU/GPUs, and will take months (if not years) of processing time. We present a high-throughput connectomics-ondemand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes.

show abstract

“…Scalable synchronization structures typically rely on efficient inter-core communication using atomic operations. Since an atomic operation becomes much slower over inter-socket links, proposals for scalable NUMAaware locks rely on hierarchically partitioned structures to maximize access locality [9][10]. On the system level, a recent study on the performance of garbage collectors on multisocket multicores analyzes synchronization patterns and systematically removes bottlenecks without completely redesigning the system [11].…”

Section: A Multisocket Multicoresmentioning

confidence: 99%

ATraPos: Adaptive transaction processing on hardware Islands

Porobic

Liarou

Tözün

et al. 2014

2014 IEEE 30th International Conference on Data Engineering

View full text Add to dashboard Cite

Abstract-Nowadays, high-performance transaction processing applications increasingly run on multisocket multicore servers. Such architectures exhibit non-uniform memory access latency as well as non-uniform thread communication costs. Unfortunately, traditional shared-everything database management systems are designed for uniform inter-core communication speeds. This causes unpredictable access latencies in the critical path. While lack of data locality may be a minor nuisance on systems with fewer than 4 processors, it becomes a serious scalability limitation on larger systems due to accesses to centralized data structures.In this paper, we propose ATraPos, a storage manager design that is aware of the non-uniform access latencies of multisocket systems. ATraPos achieves good data locality by carefully partitioning the data as well as internal data structures (e.g., state information) to the available processors and by assigning threads to specific partitions. Furthermore, ATraPos dynamically adapts to the workload characteristics, i.e., when the workload changes, ATraPos detects the change and automatically revises the data partitioning and thread placement to fit the current access patterns and hardware topology.We prototype ATraPos on top of an open-source storage manager Shore-MT and we present a detailed experimental analysis with both synthetic and standard (TPC-C and TATP) benchmarks. We show that ATraPos exhibits performance improvements of a factor ranging from 1.4 to 6.7x for a wide collection of transactional workloads. In addition, we show that the adaptive monitoring and partitioning scheme of ATraPos poses a negligible cost, while it allows the system to dynamically and gracefully adapt when the workload changes.

show abstract

NUMA-aware reader-writer locks

Cited by 67 publications

References 22 publications

Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads

Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads

A Multicore Path to Connectomics-on-Demand

ATraPos: Adaptive transaction processing on hardware Islands

Contact Info

Product

Resources

About