We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor Cores, a half-precision version of the DeepLabv3+ network achieves a peak and sustained throughput of 1.13 EF/s and 999.0 PF/s respectively.
The AFiD code, an open source solver for the incompressible Navier-Stokes equations (http://www.afid.eu), has been ported to GPU clusters to tackle large-scale wall-bounded turbulent flow simulations. The GPU porting has been carried out in CUDA Fortran with the extensive use of kernel loop directives (CUF kernels) in order to have a source code as close as possible to the original CPU version; just a few routines have been manually rewritten. A new transpose scheme, which is not limited to the GPU version only and can be generally applied to any CFD code that uses pencil distributed parallelization, has been devised to improve the scaling of the Poisson solver, the main bottleneck of incompressible solvers. The GPU version can reduce the wall clock time by an order of magnitude compared to the CPU version for large meshes. Due to the increased performance and efficient use of memory, the GPU version of AFiD can perform simulations in parameter ranges that are unprecedented in thermally-driven wall-bounded turbulence. To verify the accuracy of the code, turbulent Rayleigh-Bénard convection and plane Couette flow are simulated and the results are in good agreement with the experimental and computational data that published in previous literatures.
PROGRAM SUMMARYProgram Title: AFiD-GPU Licensing provisions(please choose one): GPLv3 Programming language: Fortan 90, CUDA Fortan, MPI External routines: PGI, CUDA Toolkit, FFTW3, HDF5 Nature of problem(approx. 50-250 words): Solving the three-dimensional Navier-Stokes equations coupled with a scalar field in a cubic box bounded between two walls and other four periodic boundaries. Solution method(approx. 50-250 words): Second order finite difference method for spatial discretization, third order Runge-Kutta scheme and Crank-Nicolson method for time advancement, two dimensional pencil distributed MPI parallelization, GPU accelerated routines. Additional comments including Restrictions and Unusual features (approx. 50-250 words): The open-source code is supported and updated on http://www.afid.eu.
Researchers have recently used the new programmable capabilities of the Graphics Processing Unit (GPU) to increase the performance of scientific codes. We investigate the use of a cluster of GPUs for large-scale CFD problems and show order-of-magnitude increases in performance and performance-to-price ratio. We implement two separate compressible flow solvers. First, we develop a CUDA-based solver for the 2D compressible Euler equations and verify the results against a reference multi-block code MBFLO. After demonstrating the performance of our Euler solver, we proceed to develop a new version of MBFLO by adding GPU-accelerated subroutines to the existing Fortran codebase. Using an eight-node cluster equiped with 16 NVIDIA 9800GX2 GPUs, we achieve speedups of up to 496x on our Euler Solver and 88x on MBFLO. This paper describes the numerical, hardware and software techniques that provide significant speedups.
This work presents the GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows. The distinct feature of the many-CPU Navier-Stokes solver in CaNS is its fast direct solver for the second-order finite-difference Poisson equation, based on the method of eigenfunction expansions. The solver implements all the boundary conditions valid for this type of problems in a unified framework. Here, we extend the solver for GPU-accelerated clusters using CUDA Fortran. The porting makes extensive use of CUF kernels and has been greatly simplified by the unified memory feature of CUDA Fortran, which handles the data migration between host (CPU) and device (GPU) without defining new arrays in the source code. The overall open-source under the terms of an MIT license.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.