2011 IEEE International Parallel &Amp; Distributed Processing Symposium 2011
DOI: 10.1109/ipdps.2011.100
|View full text |Cite
|
Sign up to set email alerts
|

Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

Abstract: The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the NUMAgrind profiling tool which can be used to simplify this process. It extends the Valgrind binary … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
11
0

Year Published

2012
2012
2018
2018

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 40 publications
1
11
0
Order By: Relevance
“…Many works have been done to improve the performance of a particular application [Shaheen and Strzodka 2012;Yang et al 2011;Castro et al 2009] or general applications [Vikranth et al 2013;Pilla et al 2011;Muddukrishna et al 2013] by increasing local memory accesses in the NUMA memory system (i.e., the first approach). nuCATS and nuCORALS [Shaheen and Strzodka 2012] improved the performance of iterative stencil computations for the NUMA memory system by optimizing temporal blocking and tiling.…”
Section: Related Workmentioning
confidence: 99%
“…Many works have been done to improve the performance of a particular application [Shaheen and Strzodka 2012;Yang et al 2011;Castro et al 2009] or general applications [Vikranth et al 2013;Pilla et al 2011;Muddukrishna et al 2013] by increasing local memory accesses in the NUMA memory system (i.e., the first approach). nuCATS and nuCORALS [Shaheen and Strzodka 2012] improved the performance of iterative stencil computations for the NUMA memory system by optimizing temporal blocking and tiling.…”
Section: Related Workmentioning
confidence: 99%
“…These tools mainly use two kinds of methods: simulation and measurement. The simulation tools such as MACPO [25] and NUMAgrind [32] collect memory traces and feed into a cache simulator. The simulator simulates an architecture with NUMA memory hierarchies to analyze the memory traces.…”
Section: Related Workmentioning
confidence: 99%
“…Tools such as MACPO [25] and NUMAgrind [32] use simulation to identify NUMA bottlenecks in a program. A drawback of tools that simulate all memory accesses is that they are slow, which makes them of limited use for programs with significant running time.…”
Section: Introductionmentioning
confidence: 99%
“…In this example, the optimization adopts the memory trace scheme similar to [10] [13]. By analyzing the memory trace, physical patterns (contrast to the logical access patterns) can be drawn and represented in memory access matrix or communication matrix [16].…”
Section: The Tuning Steps Based On Oprofilementioning
confidence: 99%
“…Some more complicated APIs are based on these basic policies, such as MAi [7] and MaMI [9].It is not an easy task to apply these API because it is much difficult to find the communication pattern in shared memory platform than message passing platform, because it is implicit and occurs through the memory accesses. Recently, some tools are available to guide a program developer on where to judiciously apply these API within a large parallel code [10][11] [12]. But it is still a hard problem to find the best mapping of the access patterns, which is considered NP-Hard [13].…”
Section: Introductionmentioning
confidence: 99%