Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Henry, Greg; Tang, Ping; Heinecke, Alexander

doi:10.1109/arith.2019.00019

Cited by 49 publications

(65 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By truncating the less significant fractional bits, converting an FP32 value to bfloat16 generates a small negative error from 0% to -0.78% relative to the original FP32 value. The factors discussed in Section 3 also minimize the negative effects of this varying error, and they explain why using the full FP32 accumulator after bfloat16 multiplication produces the best results [24], in agreement with the observation that the accumulations need to be exact. The accumulation of mean error discussed in Section 4 should also be present, but the mean error of bfloat16 is too small to cause any problems for the studied CNNs.…”

Section: Arithmetic Reason For Bfloat16 Successsupporting

confidence: 70%

Low-power implementation of Mitchell's approximate logarithmic multiplication for convolutional neural networks

Kim¹,

Barrio²,

Hermida³

et al. 2018

2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC)

View full text Add to dashboard Cite

This paper analyzes the effects of approximate multiplication when performing inferences on deep convolutional neural networks (CNNs). The approximate multiplication can reduce the cost of underlying circuits so that CNN inferences can be performed more efficiently in hardware accelerators. The study identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication. The same factors also provide an arithmetic explanation of why bfloat16 multiplication performs well on CNNs. The experiments are performed with recognized network architectures to show that the approximate multipliers can produce predictions that are nearly as accurate as the FP32 references, without additional training. For example, the ResNet and Inception-v4 models with Mitch-w6 multiplication produces Top-5 errors that are within 0.2% compared to the FP32 references. A brief cost comparison of Mitch-w6 against bfloat16 is presented, where a MAC operation saves up to 80% of energy compared to the bfloat16 arithmetic. The most far-reaching contribution of this paper is the analytical justification that multiplications can be approximated while additions need to be exact in CNN MAC operations.

show abstract

Section: Arithmetic Reason For Bfloat16 Successsupporting

confidence: 70%

Low-power implementation of Mitchell's approximate logarithmic multiplication for convolutional neural networks

Kim¹,

Barrio²,

Hermida³

et al. 2018

2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC)

View full text Add to dashboard Cite

show abstract

“…The inputs and outputs of MXU are float32 and the MAC operations on MXU are performed with bfloat16 [38]. However, one float32 number can be decomposed into multiple bfloat16 numbers and with appropriate accumulations, highprecision MAC operation can be achieved [39]. The implementation of both parallel algorithms in this work leverages the strategy of decomposition and accumulation and achieves the precision of float32.…”

Section: A Hardware Architecturementioning

confidence: 99%

Large-Scale Discrete Fourier Transform on TPUs

Chen

Hechtman

et al. 2021

IEEE Access

View full text Add to dashboard Cite

In this work, we present two parallel algorithms for the large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two parallel algorithms are associated with two DFT formulations: one formulation, denoted as KDFT, is based on the Kronecker product; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT. Both KDFT and FFT formulations take full advantage of TPU's strength in matrix multiplications. The KDFT formulation allows direct use of nonuniform inputs without additional step. In the two parallel algorithms, the same strategy of data decomposition is applied to the input data. Through the data decomposition, the dense matrix multiplications in KDFT and FFT are kept local within TPU cores, which can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme in both parallel algorithms, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, minimizing the time required by the communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow. The three-dimensional complex DFT is performed on an example of dimension 8192 × 8192 × 8192 with a full TPU Pod: the run time of KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to demonstrate the high parallel efficiency of the two DFT implementations on TPUs.

show abstract

“…Matrix calculations based on floating point numbers and bfloats have been investigated by Intel engineers using GEMM (General matrix multiply) algorithms as the basis for exploring prospects of low precision computations in comparison with standard data types using more than 16 bits [15]. The authors claim that systolic arrays based on bfloat16 and float16 provide many times faster computations than standard 32-bit floats.…”

Section: Bfloatmentioning

confidence: 99%

Analysis of Posit and Bfloat Arithmetic of Real Numbers for Machine Learning

Romanov

Stempkovsky

Lariushkin³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Modern computational tasks are often required to not only guarantee predefined accuracy, but get the result fast. Optimizing calculations using floating point numbers, as opposed to integers, is a nontrivial task. For this reason, there is a need to explore new ways to improve such operations. This paper presents analysis and comparison of various floating point formatsfloat, posit and bfloat. One of the promising areas in which the problem of using such values can be considered to be the most acute is neural networks. That is why we pay special attention to algorithms of linear algebra and artificial intelligence to assess efficiency of new data types in this area. The research results showed that software implementations of posit16 and posit32 have high accuracy, but they are not particularly fast; on the other hand, bfloat16 is not much different from float32 in accuracy, but significantly surpasses it in performance for large amounts of data and complex machine learning algorithms. Thus, posit16 can be used in systems with less stringent performance requirements, as well as in conditions of limited computer memory; and also in cases when bfloat16 cannot provide required accuracy. As for bfloat16, it can speed up systems based on the IEEE 754 standard, but it cannot solve all the problems of conventional floating point arithmetic. Thus, although posits and bfloats are not a full fledged replacement for float, they provide (under certain conditions) advantages that can be useful for implementation of machine learning algorithms.

show abstract

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Cited by 49 publications

References 7 publications

Low-power implementation of Mitchell's approximate logarithmic multiplication for convolutional neural networks

Low-power implementation of Mitchell's approximate logarithmic multiplication for convolutional neural networks

Large-Scale Discrete Fourier Transform on TPUs

Analysis of Posit and Bfloat Arithmetic of Real Numbers for Machine Learning

Contact Info

Product

Resources

About