The multidimensional positive definite advection transport algorithm (MPDATA) belongs to the group of nonoscillatory forwardin-time algorithms and performs a sequence of stencil computations. MPDATA is one of the major parts of the dynamic core of the EULAG geophysical model. In this work, we outline an approach to adaptation of the 3D MPDATA algorithm to the Intel MIC architecture. In order to utilize available computing resources, we propose the (3 + 1)D decomposition of MPDATA heterogeneous stencil computations. This approach is based on combination of the loop tiling and fusion techniques. It allows us to ease memory/communication bounds and better exploit the theoretical floating point efficiency of target computing platforms. An important method of improving the efficiency of the (3 + 1)D decomposition is partitioning of available cores/threads into work teams. It permits for reducing inter-cache communication overheads. This method also increases opportunities for the efficient distribution of MPDATA computation onto available resources of the Intel MIC architecture, as well as Intel CPUs. We discuss preliminary performance results obtained on two hybrid platforms, containing two CPUs and Intel Xeon Phi. The top-of-the-line Intel Xeon Phi 7120P gives the best performance results, and executes MPDATA almost 2 times faster than two Intel Xeon E5-2697v2 CPUs.
Modern heterogeneous computing platforms have become powerful HPC solutions, which could be applied to a wide range of real-life applications. In particular, the hybrid platforms equipped with Intel Xeon Phi coprocessors offer the advantages of massively parallel computing, while supporting practically the same parallel programming model as conventional homogeneous solutions. However, there is still an open issue as to how scientific applications can efficiently utilize hybrid platforms with Intel MIC coprocessors. In this article, we propose an approach for porting a real-life scientific application to such hybrid platforms, assuming no significant modifications of the application code. It allows us to take advantage of all the computing components, including two CPUs and two coprocessors, for the parallel execution of computational workloads. In this study, we focus on the parallel implementation of a numerical model of the dendritic solidification process in isothermal conditions. We develop a sequence of steps that are necessary for the porting and optimization of the solidification application to hybrid platforms with Intel coprocessors. The main challenges include not only overlapping data movements with computations, but also ensuring adequate utilization of cores/threads and vector units of processors, as well as coprocessors. To reach this aim, we propose an efficient and flexible method for the workload distribution between heterogeneous computing components. For implementing the potential benefits of the proposed approach, we choose a heterogeneous programming model based on a combination of the offload mode for Intel MIC and OpenMP programming standard. The developed approach allows us to execute the whole application up to 9.33 3 faster than the original parallel version that uses two CPUs. Furthermore, the CPU-MIC hybrid platforms enable achieving the speedup of about 1.9 3 that of the CPU platform with 24 cores based on the Ivy Bridge architecture, and about 1.5 3 that of the Haswell-based CPU platform with 36 cores.
In this work, we focus on a systematic adaptation of the stencil-based multidimensional positive definite advection transport algorithm (MPDATA) to different graphics processing unit (GPU)-based computing platforms. Another objective of this work is to compare the performance of MPDATA on several platforms, including a multi-GPU system with two NVIDIA Tesla K80 cards, and single-card platforms with Tesla K20X, GeForce GTX TITAN, and GeForce GTX 980. The usage of the following optimization methods is proposed to improve the overall performance: (i) reducing the number of operations by the subexpression elimination when implementing 2.5D blocking; (ii) reorganization of boundary conditions for reducing branch instructions; (iii) advanced memory management to increase the coalesced memory access; and (iv) warps rearrangement for optimizing the data access to GPU global memory. The presented methods of the MPDATA adaptation to GPU architectures allow us to efficiently use many graphics processors within a single node by applying peer-to-peer data transfers between GPU global memories. We propose an auto-tuning procedure to compensate architectural differences between the considered platforms. This procedure takes into account algorithm/GPU-specific parameters. The proposed approach to adaptation of MPDATA to GPU architectures allows us to achieve up to 482.5 Gflop/s for the platform equipped with two NVIDIA K80 GPUs. for simulating thermo-fluid flows across a wide range of scales and physical scenarios, such as numerical weather and climate prediction, simulation of urban flows, areas of turbulence, ocean currents, and others. Recently, the dynamical core of EULAG has been implemented into consortium for small-scale modeling weather prediction framework and is expected to be in operational use [5]. The dynamical core of EULAG is based on the non-hydrostatic Euler equations, either fully compressible or anelastic. The model employs the generalized curvilinear coordinate description, finite-volume non-oscillatory transport MPDATA, and advanced elliptic solver generalized conjugate residual (GCR) [6].To be able to run the existing codes efficiently on new hybrid platforms with accelerators, it is necessary to redesign structures of these codes [7]. In our previous work [8], we proposed two decompositions of 2D MPDATA computations, which provide adaptation to CPU and GPU architectures. We developed a hybrid CPU-GPU version of 2D MPDATA in order to fully utilize all the available computing resources. The next step in our research was to parallelize the 3D version of MPDATA. It required to develop a different approach than for the 2D version. In papers [7,9], we presented an analysis of resources usage in GPU, and its influence on the resulting performance. We detected the bottlenecks and developed a method for the efficient distribution of computation across GPU kernels.Following our previous papers, in this work, we propose a set of methods for adaptating the 3D MPDATA to different GPU accelerators. We investigate differ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.