2022
DOI: 10.1073/pnas.2122762119
|View full text |Cite
|
Sign up to set email alerts
|

Large-scale distributed linear algebra with tensor processing units

Abstract: We have repurposed Google tensor processing units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs’ fast intercore interconnects (ICIs), physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXUs) dominate the runtime, yielding impressive scaling, performance, and raw size: … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 20 publications
0
14
0
Order By: Relevance
“…The TPU hardware architecture is especially suited for dense large-scale matmuls, which we perform in distributed form using the SUMMA algorithm, as recently demonstrated in ref . Here, it was shown that for sufficiently large matrices a v3-512 TPU can perform dense matmuls at near-optimal efficiency: the performance per TPU core (measured in single-precision FLOPS) is maintained at roughly 93% of the single TPU core maximum performance . It is important to emphasize that TPUs are often ill-suited for other tasks, and hence the algorithms utilized in this work and those in ref had to be picked carefully and may differ from more conventional choices used in CPUs or GPUs.…”
Section: Resultsmentioning
confidence: 59%
See 1 more Smart Citation
“…The TPU hardware architecture is especially suited for dense large-scale matmuls, which we perform in distributed form using the SUMMA algorithm, as recently demonstrated in ref . Here, it was shown that for sufficiently large matrices a v3-512 TPU can perform dense matmuls at near-optimal efficiency: the performance per TPU core (measured in single-precision FLOPS) is maintained at roughly 93% of the single TPU core maximum performance . It is important to emphasize that TPUs are often ill-suited for other tasks, and hence the algorithms utilized in this work and those in ref had to be picked carefully and may differ from more conventional choices used in CPUs or GPUs.…”
Section: Resultsmentioning
confidence: 59%
“…Google’s Tensor Processing Units (TPUs) are application-specific integrated circuits originally designed to accelerate large-scale machine learning workloads. By leveraging the JAX library, it is nevertheless possible to repurpose TPUs for other computational tasks. In this work, we demonstrate the use of TPUs as quantum chemistry supercomputers by accelerating the O ( N 3 ) computational bottleneck of DFT approaches which use an auxiliary single-particle kinetic energy approximation, such as Kohn–Sham (KS) , and generalized KS (gKS) DFT, where gKS admits hybrid DFT functionals. This enables the systematic study of quantum chemistry problems at unprecedented scales.…”
Section: Introductionmentioning
confidence: 99%
“…Tensor cores, a new type of accelerated hardware developed for machine learning, are ideally suited for computations using this network structure. Our results add to the nascent, yet growing, body of work that utilizes specialized machine learning hardware for non-AI scientific purposes, e.g., see refs , and . With this article, we have further broadened the applicability of Tensor cores to quantum response calculations, which represents yet another example of a more general science application.…”
Section: Discussionmentioning
confidence: 67%
“…The development of accelerated quantum response calculations presented in this article continues a currently ongoing theme of using Tensor cores (or other similar machine learning inspired hardware) for more general scientific applications. , This work is similar in spirit to the transition from CPUs to GPUs for scientific computations that started over a decade ago and, in some sense, represents the next phase of a Darwinian-like computational evolution driven by new hardware environments. Currently, seven of the ten most powerful computers in the world utilize chips built with Tensor core accelerators, and GPU-accelerated architectures are common among the top 500 supercomputers .…”
Section: Introductionmentioning
confidence: 84%
“…A team from Citadel Enterprise America also reported on a series of HPC microbenchmarks that they ran on GraphCore IPUs [134]. Google Research has been very busy demonstrating their TPUs on a variety of parallel HPC applications including flood prediction [135], large scale distributed linear algebra [136], molecular dynamics simulation [137], fast Fourier transforms [138], [139], MRI reconstruction [140], financial Monte Carlo simulations [141], and Monte Carlo simulation of the Ising model [142]. We see this as a foreshadowing of more interesting research and development in using this high performance accelerators.…”
Section: A Broader Trendsmentioning
confidence: 99%