IN AN IDEAL WORLD, applications are expected to scale automatically when executed on increasingly larger systems. In practice, however, not only does this scaling not occur, but also it is common to see performance actually worsen on those larger-scale systems.While performance and scalability can be ambiguous terms, they becomes less so when problems present themselves at the lower end of the software stack. This is simply because the number of factors to consider when evaluating a performance problem decreases. As such, concurrent multithreaded programs such as operating-system kernels, hypervisors, and database engines can pay a high price when misusing hardware resources.This translates into performance issues for applications executing higher up in the stack. One clear example is the design and implementation of synchronization primitives (locks) for shared memory systems. Locks are a way of allowing multiple threads to execute concurrently, providing safe and correct execution context through mutual exclusion. To achieve serialization, locks typically require hardware support through the use of atomic operations such as compareand-swap (CAS), fetch-and-add, and arithmetic instructions. While details vary across different cache-coherent architectures, atomic operations will broadcast changes across the memory bus, updating the value of the shared variable for every core, forcing cacheline invalidations and, therefore, more cache-line misses. Software engineers often abuse these primitives, leading to significant performance degradation caused by poor lock granularity or high latency.Both the correctness and the performance of locks depend on the underlying hardware architecture. That is why scalability and the hardware implications are so important in the design of locking algorithms. Unfortunately, these are rare considerations in realworld software.With the advent of increasingly larger multi-and many-core NUMA (nonuniform memory access) systems, the performance penalties of poor locking implementations become painfully evident. These penalties apply to the actual primitive's implementation, as well as its usage, the latter of which many developers directly control by designing locking schemes for data serialization. After decades of research, this is a well-known fact and has never been truer than today. Despite recent technologies such as lock elision and transactional memory, however, concurrency, parallel programming, and synchronization are still challenging topics for practitioners. 10 Furthermore, because a transactional memory system such as Trans-