HPC systems and parallel applications are increasing their complexity. Therefore the possibility of easily study and project at large scale the performance of scientific applications is of paramount importance. In this paper we describe a performance analysis method and we apply it to four complex HPC applications. We perform our study on a pre-production HPC system powered by the latest Arm-based CPUs for HPC, the Marvell ThunderX2. For each application we spot inefficiencies and factors that limit their scalability. The results show that in several cases the bottlenecks do not come from the hardware but from the way applications are programmed or the way the system software is configured.
For complex engineering and scientific applications, Computational Fluid Dynamics (CFD) simulations require a huge amount of computational power. As such, it is of paramount importance to carefully assess the performance of CFD codes and to study them in depth for enabling optimisation and portability. In this paper, we study three complex CFD codes, OpenFOAM, Alya and CHORUS representing two numerical methods, namely the finite volume and finite-element methods, on both structured and unstructured meshes. To all codes, we apply a generic performance analysis method based on a set of metrics helping the code developer in spotting the critical points that can potentially limit the scalability of a parallel application. We show the root cause of the performance bottlenecks studying the three applications on the MareNostrum4 supercomputer. We conclude providing hints for improving the performance and the scalability of each application.
Computing technologies populating high-performance computing (HPC) clusters are getting more and more diverse, offering a wide range of architectural features. As a consequence, efficient programming of such platforms becomes a complex task. In this paper we provide a micro-benchmarking of three HPC clusters based on different CPU architectures, predominant in the Top500 ranking: x86, Armv8 and IBM Power9. On these platforms we study a production fluid-dynamics application leveraging different compiler technologies and micro-architectural features. We finally provide a scalability study on state-of-the-art HPC clusters. The two most relevant conclusions of our study are: i) Compiler development is critical for squeezing performance out of most recent technologies; ii) Micro-architectural features such as Single Instruction Multiple Data (SIMD) units and Simultaneous Multi-Threading (SMT) can impact the overall performance. However, a closer look shows that while SIMD is improving the performance of compute bound regions, SMT does not show a clear benefit on HPC workloads. CCS CONCEPTS• Applied computing → Physics; • Computing methodologies → Parallel computing methodologies; • General and reference → Performance; • Computer systems organization → Multicore architectures.
Educational institutions provide in most cases basic theoretical background covering several computational science topics. However, High-Performance Computing (HPC) and Parallel and Distributed Computing (PDC) markets require specialized technical profiles. Even the most skilled students are often not prepared to face production HPC applications of thousands of lines nor complex computational frameworks from other disciplines nor heterogeneous multinode machines accessed by hundreds of users. In this paper, we offer an educational package for filling this gap. Leveraging the 4-years experience of the Student Cluster Competition, we present our educational journey together with the lessons learned and the outcomes of our methodology. We show how, in a time span of a semester and an affordable budget, a university can implement an educational package preparing pupils for starting competitive professional careers. Our findings also highlight that 78% of the students exposed to our methods remain within the HPC high-education, research or industry.
In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX. This CPU is the same one that powers the Astra supercomputer, the rst Arm-based supercomputer entering the Top in November. We study from microbenchmarks up to large production codes. We include an interdisciplinary evaluation of three scienti c applications (a nite-element uid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph benchmark, focusing on parallel and energy e ciency as well as studying their scalability up to thousands of Armv cores. For comparison, we run the same tests on state-of-the-art x nodes included in Dibona and the Tier-supercomputer MareNostrum. Our experiments show that the ThunderX has a lower performance on average, mainly due to its small vector unit yet somewhat compensated by its wider links between the CPU and the main memory. We found that the software ecosystem of the Armv architecture is comparable to the one available for Intel. Our results also show that ThunderX delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.