2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2018
DOI: 10.1109/ipdpsw.2018.00091
|View full text |Cite
|
Sign up to set email alerts
|

NVIDIA Tensor Core Programmability, Performance & Precision

Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiplyand-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision.Currently, N… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
162
0
2

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
4
1
1

Relationship

2
8

Authors

Journals

citations
Cited by 277 publications
(173 citation statements)
references
References 19 publications
2
162
0
2
Order By: Relevance
“…In [53], the authors use microbenchmarks to discern microarchitectural details of the V100 architecture. In [45,59] use half precision and Tensor Cores to implement iterative solvers. They use half precision along with low quality solvers to compute the initial conditions and then switch to both higher precision solvers for subsequent iterations.…”
Section: Related Workmentioning
confidence: 99%
“…In [53], the authors use microbenchmarks to discern microarchitectural details of the V100 architecture. In [45,59] use half precision and Tensor Cores to implement iterative solvers. They use half precision along with low quality solvers to compute the initial conditions and then switch to both higher precision solvers for subsequent iterations.…”
Section: Related Workmentioning
confidence: 99%
“…We also provide a methodology for uncovering the information presented (including describing our microbenchmarks). Markidis et al [47] studied the impact of precision loss and programmability aspect of Tensor Cores for HPC application. Khairy, et al [48] studied the memory system of modern GPUs including Volta and discovered many important design decisions in the memory system.…”
Section: Related Workmentioning
confidence: 99%
“…Quantization techniques [4,6,13,28] use integer or mixed precision arithmetic only available on state-of-theart GPUs [38]. These methods reduce the computation time and the amount of storage required for the network parameters.…”
Section: Related Workmentioning
confidence: 99%