Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations

IEEE Trans. Parallel Distrib. Syst.

Kuczynski

et al. 2021

Self Cite

The advantages of the second-generation AMD EPYC Rome processors can be successfully used in the race to Exascale. However, the novel architecture's complexity makes it challenging to adapt demanding scientific codes -like stencil ones -to platforms with Rome CPUs. This paper tackles this challenge by exploring the adaptation of the stencil-based CFD (computational fluid dynamics) application called MPDATA to these processors' influential features. We show that the previously proposed parametric adaptation methodology can be profitably applied to extend the performance portability of the memory-bound MPDATA on the AMD EPYC architecture. The extension of the parametric adaptation on the novel architecture requires careful consideration of two relevant aspects that reflect splitting the Rome architecture into multiple dies -features of the cache hierarchy and partitioning cores into work teams. The paper also investigates the correlation between the performance optimizations and energy efficiency for a ccNUMA platform powered by top-of-the-line 64-core AMD Rome 7742 CPUs, comparing the results against two servers with Intel Xeon Scalable processors of different generations. Even without appealing to prices, the achieved performance and energy efficiency results are a solid argument confirming the competitiveness of AMD Rome processors against Intel Xeon CPUs in scientific applications.

show abstract

Section: Parallelization Methodology For Mpdata Code On Shared Memory Systemsmentioning

confidence: 99%

“…Through this die, a given CCD can communicate with other CCDs and the main memory, as well as with external devices connected by the PCIe bus. As a result, the EPYC 7742 CPU can provide one NUMA domain for a single processor, which is equivalent to the NUMA layout offered by current Intel Xeon CPUs [35]. This mode is known as NPS1 [5].…”

Section: Related Workmentioning

confidence: 99%

Architectural Adaptation and Performance-Energy Optimization for CFD Application on AMD EPYC Rome

IEEE Trans. Parallel Distrib. Syst.

Kuczynski

et al. 2021

Self Cite

show abstract

“…The next optimization step (version C) fits perfectly into multi-socket architectures [36], [42]. As shown in Fig.…”

Section: Energy/power and Performance Comparison Formentioning

confidence: 99%

“…To alleviate the memory-bound nature of MPDATA, we developed [9], [36], [37], [38] a parallelization methodology for MPDATA heterogeneous stencil computations. It contributes to ease the memory and communication bounds, and exploits resources of multicore ccNUMA/SMP systems better.…”

Section: Mpdata Parallelizationmentioning

confidence: 99%

“…Partitioning cores into independent work teams ( [36])this step enables the flexible management of trade-off between costs of computation and communication, following features of ccNUMA systems. As a result, two scenarios of executing MPDATA kernels are introduced: the first one performs fewer computations but requires more data traffic, while the second scenario allows us to replace the implicit data traffic by replicating some of the computations.…”

Section: Mpdata Parallelizationmentioning

confidence: 99%

See 1 more Smart Citation

Correlation of Performance Optimizations and Energy Consumption for Stencil-Based Application on Intel Xeon Scalable Processors

IEEE Trans. Parallel Distrib. Syst.

Olas

et al. 2020

Self Cite

This article provides a comprehensive study of the impact of performance optimizations on the energy efficiency of a real-world CFD application called MPDATA, as well as an insightful analysis of performance-energy interaction of these optimizations with the underlying hardware that represents the first generation of Intel Xeon Scalable processors. Considering the MPDATA iterative application as a use case, we explore the fundamentals of energy and performance analysis for a memory-bound application when exposed to a set of optimization steps that increase the application performance, by improving the operational intensity of code and utilizing resources more efficiently. It is shown that for memory-bound applications, optimizing toward high performance could be a powerful strategy for improving the energy efficiency as well. In fact, for the considered performance optimizations, the energy gain is correlated with the performance gain but with varying degrees. As a result, these optimizations allow improving both performance and energy consumption radically, up to about 10.9 and 8.8 times, respectively. The impact of the Intel AVX-512 SIMD extension on the energy consumption and performance is demonstrated. Also, we discover limitations on the usability of CPU frequency scaling as a tool for balancing energy savings with admissible performance losses.

show abstract

Toward Heterogeneous MPI+MPI Programming: Comparison of OpenMP and MPI Shared Memory Models

Euro-Par 2019: Parallel Processing Workshops

Halbiniak

et al. 2020