The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiplyand-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision.Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.
Over the recent decades, missions such as Cluster, THEMIS, and Magnetospheric Multiscale Mission (MMS), have provided the space physics community with an abundance of in situ measurements across the magnetosphere (MSP) and the solar wind. These regions contain internally distinct plasma and field characteristics, which correspond to important regions and boundaries (e.g., bow shock, magnetopause, foreshock) that are of high scientific interest. On many occasions, scientific investigations are centered explicitly on the physical processes operating at these regions/boundaries. However, before that, they must be manually identified in the data. The current state of available measurements encompasses decades of observations, and manually surveying these data and choosing regions of interest is labor-intensive and often ineffective. The combination of an improvement in the sophistication of machine learning techniques and the more immediately available computational resources, afford a means to classify and sort massive quantities of data. In this paper, we describe a machine learning methodology that can automatically identify separate plasma regions across the upstream solar wind and dayside MSP using MMS data.The principal objective of the MMS (Burch et al., 2016) is to understand the physical processes and the fundamental sequence of events causing magnetic reconnection since it is the central driver of space weather events at Earth and a fundamental plasma process across diverse plasma environments. However, MMS
CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to evaluate both in-memory and oversubscription performance.The results show that memory advises on the Intel-Volta/Pascal-PCIe platform bring negligible improvement for in-memory executions. However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory. In contrast, the Power9-Volta-NVLink platform can substantially benefit from memory advises, achieving up to 34% performance gain for in-memory executions. However, when GPU memory is oversubscribed on this platform, using memory advises increases GPU page faults and results in considerable performance loss. The CUDA prefetch also shows different performance impact on the two platforms. It improves performance by up to 50% on the Intel-Volta/Pascal-PCI-E platform but brings little benefit to the Power9-Volta-NVLink platform.
The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications.In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer.We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3× and 7.8× on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6× with respect to checkpointing directly to slower storage on our benchmark environment.
Particle-in-Cell (PIC) methods are widely used computational tools for fluid and kinetic plasma modeling. While both the fluid and kinetic PIC approaches have been successfully used to target either kinetic or fluid simulations, little was done to combine fluid and kinetic particles under the same PIC framework. This work addresses this issue by proposing a new PIC method, PolyPIC, that uses polymorphic computational particles. In this numerical scheme, particles can be either kinetic or fluid, and fluid particles can become kinetic when necessary, e.g. particles undergoing a strong acceleration. We design and implement the PolyPIC method, and test it against the Landau damping of Langmuir and ion acoustic waves, two stream instability and sheath formation. We unify the fluid and kinetic PIC methods under one common framework comprising both fluid and kinetic particles, providing a tool for adaptive fluid-kinetic coupling in plasma simulations.
No abstract
SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a storage centric approach as it is capable of storing and processing large data volumes at the Exascale regime.SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Jülich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies.The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces.The SAGE project also includes scientific applications for the validation of the SAGE concepts.The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.
Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.