2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) 2019
DOI: 10.1109/arith.2019.00019
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Abstract: In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to classical IEEE-754 32 bit (FP32) and 64 bit (FP64) arithmetic, these reduced precision arithmetic can naturally be sped up disproportional to their shortened width. The common strategy of all major hardware vendors i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
62
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 49 publications
(65 citation statements)
references
References 7 publications
3
62
0
Order By: Relevance
“…By truncating the less significant fractional bits, converting an FP32 value to bfloat16 generates a small negative error from 0% to -0.78% relative to the original FP32 value. The factors discussed in Section 3 also minimize the negative effects of this varying error, and they explain why using the full FP32 accumulator after bfloat16 multiplication produces the best results [24], in agreement with the observation that the accumulations need to be exact. The accumulation of mean error discussed in Section 4 should also be present, but the mean error of bfloat16 is too small to cause any problems for the studied CNNs.…”
Section: Arithmetic Reason For Bfloat16 Successsupporting
confidence: 70%
“…By truncating the less significant fractional bits, converting an FP32 value to bfloat16 generates a small negative error from 0% to -0.78% relative to the original FP32 value. The factors discussed in Section 3 also minimize the negative effects of this varying error, and they explain why using the full FP32 accumulator after bfloat16 multiplication produces the best results [24], in agreement with the observation that the accumulations need to be exact. The accumulation of mean error discussed in Section 4 should also be present, but the mean error of bfloat16 is too small to cause any problems for the studied CNNs.…”
Section: Arithmetic Reason For Bfloat16 Successsupporting
confidence: 70%
“…The inputs and outputs of MXU are float32 and the MAC operations on MXU are performed with bfloat16 [38]. However, one float32 number can be decomposed into multiple bfloat16 numbers and with appropriate accumulations, highprecision MAC operation can be achieved [39]. The implementation of both parallel algorithms in this work leverages the strategy of decomposition and accumulation and achieves the precision of float32.…”
Section: A Hardware Architecturementioning
confidence: 99%
“…Matrix calculations based on floating point numbers and bfloats have been investigated by Intel engineers using GEMM (General matrix multiply) algorithms as the basis for exploring prospects of low precision computations in comparison with standard data types using more than 16 bits [15]. The authors claim that systolic arrays based on bfloat16 and float16 provide many times faster computations than standard 32-bit floats.…”
Section: Bfloatmentioning
confidence: 99%