2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC) 2019
DOI: 10.1109/mcsoc.2019.00010
|View full text |Cite
|
Sign up to set email alerts
|

An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
5
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 19 publications
1
5
0
Order By: Relevance
“…We cannot compare DeLoc with CDSM, Carrefour and AsymSched because CDSM depends on a previous version of the Linux kernel and because Carrefour and Asymsched require a profiling mechanism that is available only in AMD processors. However, as shown in our evaluation and our previous work [33], the migration overhead can have a significant impact on performance, and DeLoc does not suffer from this overhead. Moreover, unlike these methods, DeLoc works on the application level and does not rely on a specific operating system or hardware.…”
Section: Related Worksupporting
confidence: 61%
See 1 more Smart Citation
“…We cannot compare DeLoc with CDSM, Carrefour and AsymSched because CDSM depends on a previous version of the Linux kernel and because Carrefour and Asymsched require a profiling mechanism that is available only in AMD processors. However, as shown in our evaluation and our previous work [33], the migration overhead can have a significant impact on performance, and DeLoc does not suffer from this overhead. Moreover, unlike these methods, DeLoc works on the application level and does not rely on a specific operating system or hardware.…”
Section: Related Worksupporting
confidence: 61%
“…First, as shown in related work [8], [14], the cost of remote-access communication remains the limiting factor in modern NUMA systems. Second, in many parallel applications, improving the communication locality also significantly affects the balance of memory accesses [16], [33], indicating that tasks that have higher amounts of communication also perform numerous memory accesses. We discuss the impacts of our method on both the communication locality and the memory congestion in Sections IV-A-3 and IV-B.…”
Section: Communication Behaviors That Affect the Locality And Thmentioning
confidence: 99%
“…In the cases of BT and SP, the performance differences among the methods are smaller than that in the other applications. It is because, as shown in our previous work [3], these two applications have the communication behavior that can benefit from the Default mapping. In these two applications, most communication events are performed by the neighboring processes, and thus the Default mapping is sufficient to improve the performance of these applications.…”
Section: Performance Evaluationmentioning
confidence: 67%
“…We obtain cache misses by aggregating the cache misses of all cache levels across the NUMA nodes. For all the applications, the migration overhead is less than 11%, and the highest overheads on the interconnect traffic are imposed in CG and FT. As shown in our previous work [3], CG has a wide variation of the amount of communication among processes. Moreover, these two applications have a high number of memory accesses.…”
Section: Overhead Of Ondeloc-mpimentioning
confidence: 69%
See 1 more Smart Citation