Steven W. D. Chien scite author profile

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiplyand-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision.Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

show abstract

Automated Classification of Plasma Regions Using 3D Particle Energy Distributions

Olshevsky

Khotyaintsev

Lalti

et al. 2021

JGR Space Physics

View full text Add to dashboard Cite

Over the recent decades, missions such as Cluster, THEMIS, and Magnetospheric Multiscale Mission (MMS), have provided the space physics community with an abundance of in situ measurements across the magnetosphere (MSP) and the solar wind. These regions contain internally distinct plasma and field characteristics, which correspond to important regions and boundaries (e.g., bow shock, magnetopause, foreshock) that are of high scientific interest. On many occasions, scientific investigations are centered explicitly on the physical processes operating at these regions/boundaries. However, before that, they must be manually identified in the data. The current state of available measurements encompasses decades of observations, and manually surveying these data and choosing regions of interest is labor-intensive and often ineffective. The combination of an improvement in the sophistication of machine learning techniques and the more immediately available computational resources, afford a means to classify and sort massive quantities of data. In this paper, we describe a machine learning methodology that can automatically identify separate plasma regions across the upstream solar wind and dayside MSP using MMS data.The principal objective of the MMS (Burch et al., 2016) is to understand the physical processes and the fundamental sequence of events causing magnetic reconnection since it is the central driver of space weather events at Earth and a fundamental plasma process across diverse plasma environments. However, MMS

show abstract

Performance Evaluation of Advanced Features in CUDA Unified Memory

Chien

2019

View full text Add to dashboard Cite

CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to evaluate both in-memory and oversubscription performance.The results show that memory advises on the Intel-Volta/Pascal-PCIe platform bring negligible improvement for in-memory executions. However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory. In contrast, the Power9-Volta-NVLink platform can substantially benefit from memory advises, achieving up to 34% performance gain for in-memory executions. However, when GPU memory is oversubscribed on this platform, using memory advises increases GPU page faults and results in considerable performance loss. The CUDA prefetch also shows different performance impact on the two platforms. It improves performance by up to 50% on the Intel-Volta/Pascal-PCI-E platform but brings little benefit to the Power9-Volta-NVLink platform.

show abstract

Characterizing Deep-Learning I/O Workloads in TensorFlow

Chien

Sishtla

Santos³

et al. 2018

View full text Add to dashboard Cite

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications.In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer.We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3× and 7.8× on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6× with respect to checkpointing directly to slower storage on our benchmark environment.

show abstract

PolyPIC: The Polymorphic-Particle-in-Cell Method for Fluid-Kinetic Coupling

et al. 2018

View full text Add to dashboard Cite

Particle-in-Cell (PIC) methods are widely used computational tools for fluid and kinetic plasma modeling. While both the fluid and kinetic PIC approaches have been successfully used to target either kinetic or fluid simulations, little was done to combine fluid and kinetic particles under the same PIC framework. This work addresses this issue by proposing a new PIC method, PolyPIC, that uses polymorphic computational particles. In this numerical scheme, particles can be either kinetic or fluid, and fluid particles can become kinetic when necessary, e.g. particles undergoing a strong acceleration. We design and implement the PolyPIC method, and test it against the Landau damping of Langmuir and ion acoustic waves, two stream instability and sheath formation. We unify the fluid and kinetic PIC methods under one common framework comprising both fluid and kinetic particles, providing a tool for adaptive fluid-kinetic coupling in plasma simulations.

show abstract

TensorFlow Doing HPC

Chien

Olshevsky

Bulatov

et al. 2019

View full text Add to dashboard Cite

The SAGE project: a storage centric approach for exascale computing

Narasimhamurthy

Danilov

et al. 2018

View full text Add to dashboard Cite

SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a storage centric approach as it is capable of storing and processing large data volumes at the Exascale regime.SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Jülich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies.The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces.The SAGE project also includes scientific applications for the validation of the SAGE concepts.The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.

show abstract

sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems

Chien

Nylund

Bengtsson

et al. 2020

View full text Add to dashboard Cite

Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Steven W. D. Chien

NVIDIA Tensor Core Programmability, Performance & Precision

Automated Classification of Plasma Regions Using 3D Particle Energy Distributions

Performance Evaluation of Advanced Features in CUDA Unified Memory

Characterizing Deep-Learning I/O Workloads in TensorFlow

PolyPIC: The Polymorphic-Particle-in-Cell Method for Fluid-Kinetic Coupling

TensorFlow Doing HPC

The SAGE project: a storage centric approach for exascale computing

sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems

Contact Info

Product

Resources

About