An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems

Agung, Mulya; Amrizal, Muhammad Alfian; Egawa, Ryusuke; Takizawa, Hiroyuki

doi:10.1109/mcsoc.2019.00010

Cited by 4 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We cannot compare DeLoc with CDSM, Carrefour and AsymSched because CDSM depends on a previous version of the Linux kernel and because Carrefour and Asymsched require a profiling mechanism that is available only in AMD processors. However, as shown in our evaluation and our previous work [33], the migration overhead can have a significant impact on performance, and DeLoc does not suffer from this overhead. Moreover, unlike these methods, DeLoc works on the application level and does not rely on a specific operating system or hardware.…”

Section: Related Worksupporting

confidence: 61%

“…First, as shown in related work [8], [14], the cost of remote-access communication remains the limiting factor in modern NUMA systems. Second, in many parallel applications, improving the communication locality also significantly affects the balance of memory accesses [16], [33], indicating that tasks that have higher amounts of communication also perform numerous memory accesses. We discuss the impacts of our method on both the communication locality and the memory congestion in Sections IV-A-3 and IV-B.…”

Section: Communication Behaviors That Affect the Locality And Thmentioning

confidence: 99%

See 1 more Smart Citation

DeLoc: A Locality and Memory-Congestion-Aware Task Mapping Method for Modern NUMA Systems

et al. 2020

Self Cite

View full text Add to dashboard Cite

The mapping of tasks to processor cores, called task mapping, is crucial to achieving scalable performance on multicore processors. On modern NUMA (non-uniform memory access) systems, the memory congestion problem could degrade the performance more severely than the data locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Conventional work on task mapping mostly focuses on improving the locality of memory accesses. However, our previous work showed that on modern NUMA systems, maximizing the locality can degrade the performance due to memory congestion. In this work, we propose a task mapping method that addresses the locality and the memory congestion problems to improve the performance of parallel applications. In the proposed method, first, the spatial and temporal communication behaviors of the applications are analyzed from the timeseries dataset of communications among the parallel tasks. Then, a data clustering technique is employed to detect groups of tasks that potentially cause the memory congestion. Finally, this information is used to compute the task mapping to improve the locality and reduce the memory congestion. We also provide a set of metrics to describe the communication behaviors and to evaluate if the target application can benefit from our method. The proposed method is evaluated with the NPB and PARSEC applications on a real NUMA system and a multicore simulator. A detailed analysis of the sources of performance gain is also provided. Experimental results show that our method can achieve up to a 61% performance improvement compared with the state-of-the-art locality-based method.

show abstract

Section: Related Worksupporting

confidence: 61%

Section: Communication Behaviors That Affect the Locality And Thmentioning

confidence: 99%

DeLoc: A Locality and Memory-Congestion-Aware Task Mapping Method for Modern NUMA Systems

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the cases of BT and SP, the performance differences among the methods are smaller than that in the other applications. It is because, as shown in our previous work [3], these two applications have the communication behavior that can benefit from the Default mapping. In these two applications, most communication events are performed by the neighboring processes, and thus the Default mapping is sufficient to improve the performance of these applications.…”

Section: Performance Evaluationmentioning

confidence: 67%

“…We obtain cache misses by aggregating the cache misses of all cache levels across the NUMA nodes. For all the applications, the migration overhead is less than 11%, and the highest overheads on the interconnect traffic are imposed in CG and FT. As shown in our previous work [3], CG has a wide variation of the amount of communication among processes. Moreover, these two applications have a high number of memory accesses.…”

Section: Overhead Of Ondeloc-mpimentioning

confidence: 69%

See 1 more Smart Citation

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Agung

Amrizal

Egawa

et al. 2020

JSFI

View full text Add to dashboard Cite

Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.

show abstract

An effective scheme for memory congestion reduction in multi-core environment

Upadhyay

Singh

2022

Journal of King Saud University - Computer and Information Scie

View full text Add to dashboard Cite

An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems

Cited by 4 publications

References 19 publications

DeLoc: A Locality and Memory-Congestion-Aware Task Mapping Method for Modern NUMA Systems

DeLoc: A Locality and Memory-Congestion-Aware Task Mapping Method for Modern NUMA Systems

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

An effective scheme for memory congestion reduction in multi-core environment

Contact Info

Product

Resources

About