Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product

Jouppi, Norman P.; Yoon, Doe Hyun; Ashcraft, Matthew B.; Gottscho, Mark; Jablin, Thomas B.; Kurian, George Thomas; Laudon, James; Sheng, Li; Ma, Peter; Ma, Xiaoyu; Norrie, Thomas; Patil, Nishant; Prasad, Sushma; Young, Cliff; Zhou, Zongwei; Patterson, David A.

doi:10.1109/isca52012.2021.00010

Cited by 187 publications

(104 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the case of Layerweaver, while most of the BERT-large (NLP) requests satisfy the QoS constraints, over 90% of the MobileNetV2 (vision) requests violate them. Due to the growing importance of support for multi-tenancy on NPUs, the lack of QoS is a serious drawback in datacenters [3].…”

Section: Limitations Of the Prior Artmentioning

confidence: 99%

“…For example, Google TPUv3 [2], which targets both DNN training and inference, features 128 TOP/s of computation and 900 GB/s off-chip memory bandwidth. In contrast, TPUv4i [3] targets DNN inference only, and its compute-to-memory bandwidth ratio is substantially higher On the other hand, DNN models in service have very different arithmetic intensities depending on their layer structures, operators, etc. Thus, there is no one-size-fitsall accelerator that works well for all of those DNN models.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

Jin

Ham

et al. 2022

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Many cloud service providers employ specialized hardware accelerators, called neural processing units (NPUs), to accelerate deep neural networks (DNNs). An NPU scheduler is responsible for scheduling incoming user requests and required to satisfy the two, often conflicting, optimization goals: maximizing system throughput and satisfying quality-of-service (QoS) constraints (e.g., deadlines) of individual requests. We propose Layerweaver+, a low-cost layer-wise DNN scheduler for NPUs, which provides both high system throughput and minimal QoS violations. For a serving scenario based on the industry-standard MLPerf inference benchmark, Layerweaver+ significantly improves the system throughput by up to 266.7% over the baseline scheduler serving one DNN at a time.

show abstract

Section: Limitations Of the Prior Artmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

Jin

Ham

et al. 2022

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…As opposed to SysAr, TPU v3 [29] changes VU structure to accelerate less arithmetically intensive operations such as inverse-square-root of BN while training, albeit not elaborating on details about processing DW-CONV in the modiied VU. In addition, as TPU v4 [28] reuses hardware designs of TPU v3 except for several components such as on-chip memory capacity, on-chip interconnect, and DMA, the VU of TPU v4 is the same structure as that of TPU v3. There have been processing-near-DRAM studies [10,14,31] to provide high of-chip memory bandwidth during inference.…”

Section: Related Workmentioning

confidence: 99%

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

Lee¹,

Choi²,

Jung³

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6 × and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.

show abstract

“…A spatial compute array is the key component in many popular low-cost CNN accelerators [50,58,97,[113][114][115][116][117][118][119][120][121][122][123].…”

Section: Spatial Architectures For Cnn Inferencementioning

confidence: 99%

“…By orchestrating data into and out of the PE network, spatial architectures can efficiently implement either matrix multiplications or convolutions. Examples of spatial architectures include Eyeriss V1/V2 [50,113], Google's TPU [97,117], NVIDIA's CUDA Tensor Cores [124], Nanofabrics [125], TRIPS [126], RAW [127], SmartMemories [128], FlexFlow [114][115][116], SCNN [129], and Morph [130]. Figure 2 illustrates the core elements of a common spatial architecture for CNN inference.…”

Section: Spatial Architectures For Cnn Inferencementioning

confidence: 99%

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions. These algorithms are tailored to execute efficiently on existing low-cost, small-scale spatial architectures and requires minimal changes to the network-on-chip of existing accelerators. At its core, EcoFlow eliminates zero padding through careful dataflow orchestration and data mapping tailored to the spatial architecture. EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference. We evaluate the efficiency of our dataflows on CNN training workloads and Generative Adversarial Network (GAN) training workloads. Experiments in our new cycle-accurate spatial architecture simulator show that EcoFlow 1) reduces end-to-end CNN training time between 7-85%, and 2) improves end-to-end GAN training performance between 29-42%, compared to state-of-the-art CNN inference accelerators. [Open-Source Artifact]We open-source both our Spatial Architecture Simulator for Machine Learning (SASiML) and the SASiML compiler to help enable the development of new dataflows and high-accuracy simulation environments for new spatial architectures and dataflows. This can be freely found at https://github.com/CMU-SAFARI/sasiml.

show abstract

Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product

Cited by 187 publications

References 20 publications

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Contact Info

Product

Resources

About