In-Datacenter Performance Analysis of a Tensor Processing Unit

Jouppi, Norman P.; Young, Cliff; Patil, Nishant; Patterson, David A.; Agrawal, Gaurav; Bajwa, R.S.; Bates, S.C.; Bhatia, Suresh; Boden, Nan; Borchers, Al; Boyle, Rick; Cantin, Pierre-luc; Chao, Clifford; Clark, Chris; Coriell, Jeremy; Daley, Mike; Dau, Matt; Dean, J. Michael; Gelb, Ben; Ghaemmaghami, Tara Vazir; Gottipati, Rajendra; Gulland, William; Hagmann, Robert B.; Ho, Chien-Te; Hogberg, Doug; Hu, John Wei-Shan; Hundt, Robert; Hurt, Dan; Ibarz, Julian; Jaffey, Aaron; Jaworski, Alek; Kaplan, Alexander; Khaitan, Harshit; Killebrew, Daniel; Koch, Andy; Kumar, Naveen; Lacy, Steve; Laudon, James; Law, James; Le, Diemthu; Leary, Chris; Liu, Zhuyuan; Lucke, Kyle; Lundin, Alan; MacKean, Gordon; Maggiore, A.; Mahony, Maire; Miller, Kieran; Nagarajan, Rahul; Narayanaswami, Ravi; Ni, Ray; Nix, Kathy; Norrie, Thomas; Omernick, Mark; Penukonda, Narayana; Phelps, Andy; Ross, Jonathan; Ross, Matt; Salek, Amir; Samadiani, Emad; Severn, Chris; Sizikov, Gregory; Snelham, Matthew; Souter, Jed; Steinberg, Dan; Swing, Andy; Tan, Mercedes; Thorson, Gregory; Tian, Bangsen; Toma, Horia; Tuttle, Erick; Vasudevan, Vijay; Walter, Richard; Wang, Walter; Wilcox, Eric; Yoon, Doe Hyun

doi:10.1145/3140659.3080246

Cited by 1,113 publications

(1,218 citation statements)

References 41 publications

(37 reference statements)

Supporting

Mentioning

1,209

Contrasting

Unclassified

Order By: Relevance

“…In addition, 28MB of software-managed on-chip memory is included to store the intermediate results and the inputs of the Matrix Multiply Unit. The datapath occupies 67% of the TPU floorplan, while the area occupied for control is only 2% [17]. This contrasts with the state-of-theart server CPUs and GPUs, in which the control structures occupy significant chip area and lead to increased power consumption.…”

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

“…Based on a projection that voice-based search will significantly increase the computational demands of Google's datacenters, a custom ASIC chip -called Tensor Processing Unit (TPU) --was designed and deployed by Google in 2015 [17]. TPU is aimed at accelerating the inference phase of different types of neural network applications, including multi-layer perceptrons (MLP), convolutional neural networks (CNN), and recurrent neural networks (RNN) [18].…”

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

See 1 more Smart Citation

Emerging Accelerator Platforms for Data Centers

Özdal

2018

IEEE Des. Test

View full text Add to dashboard Cite

Today's server architectures are designed considering the needs of a wide range of applications. For example, superscalar processors include complex control logic for out of order execution to extract instruction-level parallelism (ILP) from arbitrary programs. However, not all workloads utilize the features of a superscalar processor effectively. For example, a workload that exhibits a regular execution pattern (e.g. a dense linear algebra kernel) may not require the expensive ILP control logic for parallelism. Instead, it can be run on a throughput-oriented architecture with thousands of simple cores, such as a GPU, which can lead to much better performance and power efficiency. On the other hand, only a limited class of data-parallel applications can utilize the high throughputs provided by such architectures. As a matter of fact, existing CPU and GPU platforms may not be the most efficient choices for the compute patterns of a wide range of applications.For big data workloads, access to data is typically at least as important bottleneck as computation. The memory subsystems of today's CPU architectures are optimized for workloads that have reasonable data access locality. CPU cache hierarchies include different sizes of caches, which help capture different levels of access localities in different applications. However, if an application exhibits very little or no locality, the data access operations become inefficient for these architectures.As an example, let us consider graph applications that run on very large and unstructured datasets. Typically, the data of a vertex is computed/updated based on the data of its neighbors. In an unstructured graph, the neighbors of a vertex are stored in memory locations that may be far from each other. So, traversing the neighbors of a vertex may involve a random memory access per neighbor. If the graph is large enough so that it does not fit into the last level cache (LLC), each access to a neighbor's data may require a random DRAM access, which is typically hundreds of clock cycles. However, existing CPU architectures are not optimized for frequent random DRAM accesses. For example, each Intel Haswell Xeon core has 10 line-fill-buffers (LFBs), which means that each core can handle at most 10 L1 cache misses at a given time. However, an off-chip DRAM latency of hundreds of cycles requires hundreds of outstanding memory requests to be able to utilize the full DRAM bandwidth available in the system [1]. It was reported that 10 or more Xeon cores were needed for various graph applications to fully utilize the available DRAM bandwidth [2]. Furthermore, due to the low compute to memory-access ratios in graph applications, these cores are frequently stalled while waiting for data from off-chip memory. This leads to high power consumption by 10+ superscalar cores while not doing useful work. It was shown that custom architectures that target such communication patterns have the potential to improve power efficiency by a factor of 50x or more compared to the general-purpose CPU...

show abstract

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

Emerging Accelerator Platforms for Data Centers

Özdal

2018

IEEE Des. Test

View full text Add to dashboard Cite

show abstract

“…Mesh topology is a strikingly popular way to organize PEs, for example, Google's TPU [12], the DianNao family [15], MIT's Eyeriss [16] (see Fig. 6).…”

Section: I S P a T I A L D A T A F L O W A R C H I T E C T U R Ementioning

confidence: 99%

AI, native supercomputing and the revival of Moore's Law

Lu¹

2017

SIP

View full text Add to dashboard Cite

What kind of computing machinery do we need to advance Artificial Intelligence (AI) to human level? At the dawn of computing, one of the founding fathers, Alan Turing, believed that AI could be approached as software running on a universal computer. This was a revolutionary idea given that during his time, the term "computer" was generally referred to as a human hired to do calculations with pencil on paper. Turing referred to a machine as a "digital computer" to distinguish it from the human one.In the context of AI, Alan Turing is remembered for his Imitation Game, or later referred to as Turing Test, in which a machine strives to exhibit intelligence to make itself indistinguishable from a human in the eyes of an interrogator. In his landmark paper, "Computing Machinery Novumind Inc, Hardware Engineering, Santa Clara, California, USA Corresponding author: C.-P. Lu Email: cpl@novumind.com and Intelligence" [1], he tried to address the ultimate AI question, "Can machines think?" He reframed the question more precisely and unambiguously by asking how well a machine does in the imitation game. Turing hypothesized that human intelligence is "computable, " which has a precise mathematical meaning famously established by himself [2], as a bag of discrete state machines, and reframed the ultimate AI question as Are there discrete machines that would do well (in the imitation game)? [1] But what exactly are the discrete state machines to win the imitation game? Apparently, he did not know during his time; but witnessing the extreme difficulty of building a non-human, electronic computer himself [3], he envisioned only one machine, the Universal Digital Computer that could mimic any discrete state machine. Each discrete state machine can be encoded as numbers to be processed by a universal computer. The numbers that encode a discrete state machine become software, and the computing machinery became the "stored program computer" Thereafter, the history of computing has been mainly the race to build faster universal computers to answer the following challenge:Are there imaginable digital computers that would do well (in the imitation game)? [1] AI researchers and thinkers have been advancing AI without worrying about the underlying computing machinery. People might argue that this applies only to traditional rule-based AI. However, even connectionists have to translate their connectionist systems into algorithms in software to prove and demonstrate their ideas. We have been seeing advances and innovations in Deep Learning completely decoupled from the underlying computing machinery. Today, we use terms like "machines", "networks", "neurons", and "synapses", without a second thought about the fact that those entities do not have to exist physically. People ponder about a grand unified theory of Deep Learning using ideas like "emergent behaviors", "intuitions", "non-linear dynamics", believing that those concepts could be adequately represented or approximated by software. According to Turing, any fixed-function Deep Learn...

show abstract

“…이를 극복하기 위해 합성곱(convolution) 처리를 위한 특별한 프로세서 연구개발이 진행 중이다 [4]. 그 가운데 구글의 TPU(Tensor Processing Unit)에서는 특정 기능만을 수행하여 연 산 속도를 개선시킨 것에 대한 연구결과를 발표했다 [5] [6]. 이에 본 논문에서 합성곱 및 pooling 연 산에서 곱셈과 덧셈을 빠르게 계산 할 수 있고 병렬처리가 가능한 ALU 연산기를 제안한다.…”

Section: 서론 기계학습 분야에서 Cnn 알고리즘은 이미지 인식 및 분류에 있어서 높은 인식률을 자랑한다unclassified

A design of the ALU for Convolution Neural Network of operation processing

Nam¹

2017

AJMAHS

View full text Add to dashboard Cite

The CNN algorithm exhibits excellent performance in image recognition but requires a large amount of computation processing and requires a lot of learning time each time data learning is accumulated. To solve such problems, recently various IPU and TPU have been develop to accelerate the neural network operation which is several times to several tens times faster than conventional CPU and GPU. In this paper, we propose an ALU for efficient multiplication and addition of CNN. ALU design was implemented on the Xilinx VC-707 FPGA board using Verilog HDL. Twenty five 8bit modified booth multipliers were designed with a square matrix structure and processed 200 bits per clock. In order to improve the computation speed, the arithmetic unit performs parallel processing using pipelining. Experiments were performed to verify the performance of the GPU and proposed structure MNIST 's numerical image database by comparing and measuring the computation time of the composite neural network processing.

show abstract

In-Datacenter Performance Analysis of a Tensor Processing Unit

Cited by 1,113 publications

References 41 publications

Emerging Accelerator Platforms for Data Centers

Emerging Accelerator Platforms for Data Centers

AI, native supercomputing and the revival of Moore's Law

A design of the ALU for Convolution Neural Network of operation processing

Contact Info

Product

Resources

About