Parallel program performance often critically depends on barrier performance. In modern NUMA multi-core machines, barrier synchronization performance is significantly affected by cache-coherence communication between cores, especially when the scale of NUMA systems is large, complex interconnected networks, memory hierarchies, and cache-coherence protocols make optimization of barrier algorithm hard.We propose a general barrier optimization framework on NUMA multi-core machines. The framework splits the barrier into three stages: the barrier arrival within a NUMA node, the barrier arrival across the NUMA nodes, and the wakeup, providing an opportunity to optimize the communication pattern and the cache-line placement in each stage. To reduce remote communication traffic, we introduce a coordinator per NUMA node. In addition, we implement two barrier algorithms based on the framework. Finally, we show the superiority of the barrier algorithms within our framework over other barrier algorithms and show how to translate a barrier algorithm into a performance model to help make an optimal tradeoff design. Experiments were conducted on three NUMA multi-core platforms and the results show that the barrier algorithm optimized within our framework is sufficient to deliver as good or better performance than state-of-art approaches on NUMA multi-core machines.
Modern NUMA multicore architectures exhibit complicated memory behavior, such as cache coherence invalidation and nonuniform memory access where the access from a core to its local memory is significantly faster than crossnode access to memory on a different NUMA node. The complicated memory behavior has a large impact on the efficiency of locking synchronization, which affects the performance of parallel applications. Prior works offer several efficient designs to improve locking performance such as delegation schemes. However, the existing delegation schemes either occupy computing cores or provide nonscalable performance, or offer less portability. In this work, we present a NUMA-aware delegation lock that occupies no cores while offering scalable performance under high contention for NUMA multicore machines. The new lock is a variant of an efficient FFWD lock, and inherits its performance features, such as buffering responses within a NUMA node to minimize cache coherence traffic. Unlike FFWD, the new lock employs hierarchical NUMA-aware memory allocation and NUMA-aware dynamic server thread technique, to reduce crossnode communication between client and server threads. Our evaluation shows that the new lock outperforms FFWD under high contention, achieving the significant performance gains when compared with other state-of-the-art locks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.