Traffic management

Dashti, Mohammad; Fedorova, Alexandra; Funston, Justin; Gaud, Fabien; Lachaize, Renaud; Lepers, Baptiste; Quéma, Vivien; Roth, Mark A.

doi:10.1145/2451116.2451157

Cited by 199 publications

(16 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A memory placement method, called Carrefour, has been proposed in [10]. It improves performance on modern NUMA systems by reconciling the data locality and the memory congestion problems.…”

Section: Related Workmentioning

confidence: 99%

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Agung

Amrizal

Egawa

et al. 2020

JSFI

View full text Add to dashboard Cite

Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.

show abstract

“…A memory placement method, called Carrefour, has been proposed in [10]. It improves performance on modern NUMA systems by reconciling the data locality and the memory congestion problems.…”

Section: Related Workmentioning

confidence: 99%

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Agung

Amrizal

Egawa

et al. 2020

JSFI

View full text Add to dashboard Cite

show abstract

“…Hybrid memory placement policies attempt to fully utilize total system bandwidth by distributing pages between system memory and stacked memory based on the bandwidth ratio (Agarwal et al 2015;Chou et al 2015a). NUMA aware placement on the other hand focuses on data placement near computing resources to minimize overall latency (Dashti et al 2013;Verghese et al 1996;Bolosky et al 1989). Our work is orthogonal these proposals.…”

Section: Related Workmentioning

confidence: 99%

Ducati

Jaleel

Ebrahimi

Duncan

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, LLT misses incur long address translation latency and hurt performance. This article proposes two low-overhead hardware mechanisms for reducing the frequency and penalty of ondie LLT misses. The first, Unified CAche and TLB (UCAT), enables the conventional on-die Last-Level Cache to store cache lines and TLB entries in a single unified structure and increases on-die TLB capacity significantly. The second, DRAM-TLB, memoizes virtual to physical address translations in DRAM and reduces LLT miss penalty when UCAT is unable to fully cover total application working-set. DRAM-TLB serves as the next larger level in the TLB hierarchy that significantly increases TLB coverage relative to on-chip TLBs. The combination of these two mechanisms, DUCATI, is an address translation architecture that improves GPU performance by 81% (up to 4.5×) while requiring minimal changes to the existing system design. We show that DUCATI is within 20%, 5%, and 2% the performance of a perfect LLT system when using 4KB, 64KB, and 2MB pages, respectively. CCS Concepts: • Computer systems organization → Architectures; • Hardware → Semiconductor memory; Emerging technologies; Emerging interfaces; • Software and its engineering → Memory management;

show abstract

“…Recently, migrating memory pages to improve memory locality during the execution of a parallel application has received renewed attention. Several such mechanisms have been proposed, operating on the hardware level [7,8,41], compiler-level [33,38], or OSlevel [9,11,17]. These mechanisms do not require changes to the application to improve locality, but can cause a significant runtime overhead that limits their gains compared to the manual changes applied in this paper.…”

Section: Related Workmentioning

confidence: 99%

“…As memory is shared between all threads on the same node in an OpenMP environment, care must be taken to place data close to the threads that use it. This can result in significantly faster data accesses in shared memory architectures [3,7,9,11,16,33]. On the other hand, data used by each MPI rank is generally private to the rank [14], such that locality issues have a much lower impact on a single cluster node in general.…”

Section: Introductionmentioning

confidence: 99%

Improving the memory access locality of hybrid MPI applications

Diener

White

Kalé

et al. 2017

Proceedings of the 24th European MPI Users' Group Meeting

View full text Add to dashboard Cite

Maintaining memory access locality is continuing to be a challenge for parallel applications and their runtime environments. By exploiting locality, application performance, resource usage, and performance portability can be improved. The main challenge is to detect and fix memory locality issues for applications that use shared-memory programming models for intra-node parallelization. In this paper, we investigate improving memory access locality of a hybrid MPI+OpenMP application in two different ways, by manually fixing locality issues in its source code and by employing the Adaptive MPI (AMPI) runtime environment. Results show that AMPI can result in similar locality improvements as manual source code changes, leading to substantial performance and scalability gains compared to the unoptimized version and to a pure MPI runtime. Compared to the hybrid MPI+OpenMP baseline, our optimizations improved performance by 1.8x on a single cluster node, and by 1.4x on 32 nodes, with a speedup of 2.4x compared to a pure MPI execution on 32 nodes. In addition to performance, we also evaluate the impact of memory locality on the load balance within a node. CCS CONCEPTS • Computer systems organization → Multicore architectures; Distributed architectures; • Software and its engineering → Main memory; Runtime environments;

show abstract

Traffic management

Cited by 199 publications

References 22 publications

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

Ducati

Improving the memory access locality of hybrid MPI applications

Contact Info

Product

Resources

About