Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

Yang, Rui; Antony, Joseph; Rendell, Alistair P.; Robson, D.; Strazdins, Peter

doi:10.1109/ipdps.2011.100

Cited by 13 publications

(12 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many works have been done to improve the performance of a particular application [Shaheen and Strzodka 2012;Yang et al 2011;Castro et al 2009] or general applications [Vikranth et al 2013;Pilla et al 2011;Muddukrishna et al 2013] by increasing local memory accesses in the NUMA memory system (i.e., the first approach). nuCATS and nuCORALS [Shaheen and Strzodka 2012] improved the performance of iterative stencil computations for the NUMA memory system by optimizing temporal blocking and tiling.…”

Section: Related Workmentioning

confidence: 99%

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Chen

Guo

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern mainstream powerful computers adopt multisocket multicore CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers. To solve the problem, we propose a locality-aware work-stealing (LAWS) scheduler, which better utilizes both the shared cache and the memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the dataset of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data for reducing remote memory accesses. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into cache-friendly subtrees. After cache-friendly subtrees are created, every socket executes cache-friendly subtrees sequentially for optimizing shared cache usage. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Through theoretical analysis, we show that LAWS has comparable time and space bounds compared with traditional work-stealing schedulers. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% on AMD-based experimental platforms and up to 48.6% on Intel-based experimental platforms compared with traditional work-stealing schedulers. ACM Reference Format:Quan Chen and Minyi Guo. 2015. Locality-aware work stealing based on online profiling and auto-tuning for multisocket multicore architectures. ACM Trans. " , which was published in the International Conference on Supercomputing (ICS 2014). The 30% new material comes from two aspects:-We have analyzed the theoretical time and space bounds of LAWS. Based on our analysis, the theoretical time and space bounds are comparable to the original random work-stealing scheduler. -This article has also significantly enhanced the experimental evaluation. We have evaluated the performance of LAWS on both Intel-based platforms and AMD-based platforms.

show abstract

Section: Related Workmentioning

confidence: 99%

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Chen

Guo

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…These tools mainly use two kinds of methods: simulation and measurement. The simulation tools such as MACPO [25] and NUMAgrind [32] collect memory traces and feed into a cache simulator. The simulator simulates an architecture with NUMA memory hierarchies to analyze the memory traces.…”

Section: Related Workmentioning

confidence: 99%

“…Tools such as MACPO [25] and NUMAgrind [32] use simulation to identify NUMA bottlenecks in a program. A drawback of tools that simulate all memory accesses is that they are slow, which makes them of limited use for programs with significant running time.…”

Section: Introductionmentioning

confidence: 99%

A tool to analyze the performance of multithreaded programs on NUMA architectures

Liu

Mellor-Crummey

2014

Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.

show abstract

“…In this example, the optimization adopts the memory trace scheme similar to [10] [13]. By analyzing the memory trace, physical patterns (contrast to the logical access patterns) can be drawn and represented in memory access matrix or communication matrix [16].…”

Section: The Tuning Steps Based On Oprofilementioning

confidence: 99%

“…Some more complicated APIs are based on these basic policies, such as MAi [7] and MaMI [9].It is not an easy task to apply these API because it is much difficult to find the communication pattern in shared memory platform than message passing platform, because it is implicit and occurs through the memory accesses. Recently, some tools are available to guide a program developer on where to judiciously apply these API within a large parallel code [10][11] [12]. But it is still a hard problem to find the best mapping of the access patterns, which is considered NP-Hard [13].…”

Section: Introductionmentioning

confidence: 99%

MAP-numa: Access Patterns Used to Characterize the NUMA Memory Access Optimization Techniques and Algorithms

Luo

Liu

Kong

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Some typical memory access patterns are provided and programmed in C, which can be used as benchmark to characterize the various techniques and algorithms aim to improve the performance of NUMA memory access. These access patterns, called MAP-numa (Memory Access Patterns for NUMA), currently include three classes, whose working data sets are corresponding to 1-dimension array, 2-dimension matrix and 3-dimension cube. It is dedicated for NUMA memory access optimization other than measuring the memory bandwidth and latency. MAP-numa is an alternative to those exist benchmarks such as STREAM, pChase, etc. It is used to verify the optimizations' (made automatically/manually to source code/executive binary) capacities by investigating what locality leakage can be remedied. Some experiment results are shown, which give an example of using MAP-numa to evaluate some optimizations based on Oprofile sampling.

show abstract

Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

Cited by 13 publications

References 40 publications

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

A tool to analyze the performance of multithreaded programs on NUMA architectures

MAP-numa: Access Patterns Used to Characterize the NUMA Memory Access Optimization Techniques and Algorithms

Contact Info

Product

Resources

About