NVIDIA Tensor Core Programmability, Performance &amp; Precision

Markidis, Stefano; Chien, Steven W. D.; Laure, Erwin; Peng, Ivy Bo; Vetter, Jeffrey S.

doi:10.1109/ipdpsw.2018.00091

Cited by 277 publications

(173 citation statements)

References 19 publications

Supporting

Mentioning

162

Contrasting

Unclassified

Order By: Relevance

“…In [53], the authors use microbenchmarks to discern microarchitectural details of the V100 architecture. In [45,59] use half precision and Tensor Cores to implement iterative solvers. They use half precision along with low quality solvers to compute the initial conditions and then switch to both higher precision solvers for subsequent iterations.…”

Section: Related Workmentioning

confidence: 99%

Accelerating reduction and scan using tensor core units

Dakkak

Xiong

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89% − 98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100× for reduction and 3× for scan) than state-of-the-art methods for small segment sizes -common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.

show abstract

Section: Related Workmentioning

confidence: 99%

Accelerating reduction and scan using tensor core units

Dakkak

Xiong

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…We also provide a methodology for uncovering the information presented (including describing our microbenchmarks). Markidis et al [47] studied the impact of precision loss and programmability aspect of Tensor Cores for HPC application. Khairy, et al [48] studied the memory system of modern GPUs including Volta and discovered many important design decisions in the memory system.…”

Section: Related Workmentioning

confidence: 99%

Modeling Deep Learning Accelerator Enabled GPUs

Raihan

Goli

Aamodt

2019

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

The efficacy of deep learning has resulted in its use in a growing number of applications. The Volta graphics processor unit (GPU) architecture from NVIDIA introduced a specialized functional unit, the "tensor core", that helps meet the growing demand for higher performance for deep learning. In this paper we study the design of the tensor cores in NVIDIA's Volta and Turing architectures. We further propose an architectural model for the tensor cores in Volta. When implemented a GPU simulator, GPGPU-Sim, our tensor core model achieves 99.6% correlation versus an NVIDIA Titan V GPU in terms of average instructions per cycle when running tensor core enabled GEMM workloads. We also describe support added to enable GPGPU-Sim to run CUTLASS, an opensource CUDA C++ template library providing customizable GEMM templates that utilize tensor cores.

show abstract

“…Quantization techniques [4,6,13,28] use integer or mixed precision arithmetic only available on state-of-theart GPUs [38]. These methods reduce the computation time and the amount of storage required for the network parameters.…”

Section: Related Workmentioning

confidence: 99%

Cascaded Projection: End-To-End Network Compression and Acceleration

Minnehan

Savakis

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We propose a data-driven approach for deep convolutional neural network compression that achieves high accuracy with high throughput and low memory requirements. Current network compression methods either find a lowrank factorization of the features that requires more memory, or select only a subset of features by pruning entire filter channels. We propose the Cascaded Projection (CaP) compression method that projects the output and input filter channels of successive layers to a unified low dimensional space based on a low-rank projection. We optimize the projection to minimize classification loss and the difference between the next layer's features in the compressed and uncompressed networks. To solve this non-convex optimization problem we propose a new optimization method of a proxy matrix using backpropagation and Stochastic Gradient Descent (SGD) with geometric constraints. Our cascaded projection approach leads to improvements in all critical areas of network compression: high accuracy, low memory consumption, low parameter count and high processing speed. The proposed CaP method demonstrates state-of-the-art results compressing VGG16 and ResNet networks with over 4× reduction in the number of computations and excellent performance in top-5 accuracy on the ImageNet dataset before and after fine-tuning.

show abstract

NVIDIA Tensor Core Programmability, Performance & Precision

Cited by 277 publications

References 19 publications

Accelerating reduction and scan using tensor core units

Accelerating reduction and scan using tensor core units

Modeling Deep Learning Accelerator Enabled GPUs

Cascaded Projection: End-To-End Network Compression and Acceleration

Contact Info

Product

Resources

About