A Study of BFLOAT16 for Deep Learning Training

Kalamkar, Dhiraj D.; Mudigere, Dheevatsa; Mellempudi, Naveen; Das, Dipankar; Banerjee, Kunal; Avancha, Sasikanth; Vooturi, Dharma Teja; Jammalamadaka, Nataraj; Huang, Jianyu; Yuen, Hector; Yang, Jiyan; Park, Jongsoo; Heinecke, Alexander; Georganas, Evangelos; Srinivasan, Sudarshan K.; Kundu, Abhisek; Smelyanskiy, Misha; Kaul, Bharat; Dubey, Pradeep

doi:10.48550/arxiv.1905.12322

Cited by 45 publications

(59 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We adopt BF16 as the data format. BF16 has the same accuracy as FP32 for NN training [68] but is more cost-efficient. We estimate CAE and NME's area and power using 16nm and 28nm technologies, respectively.…”

Section: Evaluation a Methodologymentioning

confidence: 99%

GCNear: A Hybrid Architecture for Efficient GCN Training with Near-Memory Processing

Zhu¹,

Liu²,

Wei³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, Graph Convolutional Networks (GCNs) have become state-of-the-art algorithms for analyzing noneuclidean graph data. However, it is challenging to realize efficient GCN training, especially on large graphs. The reasons are many-folded: 1) GCN training incurs a substantial memory footprint. Full-batch training on large graphs even requires hundreds to thousands of gigabytes of memory to buffer the intermediate data for back-propagation. 2) GCN training involves both memory-intensive data reduction and computation-intensive features/gradients update operations. Such a heterogeneous nature challenges current CPU/GPU platforms. 3) The irregularity of graphs and the complex training dataflow jointly increase the difficulty of improving a GCN training system's efficiency.This paper presents GCNear, a hybrid architecture to tackle these challenges. Specifically, GCNear adopts a DIMM-based memory system to provide easy-to-scale memory capacity. To match the heterogeneous nature, we categorize GCN training operations as memory-intensive Reduce and computation-intensive Update operations. We then offload Reduce operations to on-DIMM NMEs, making full use of the high aggregated local bandwidth. We adopt a CAE with sufficient computation capacity to process Update operations. We further propose several optimization strategies to deal with the irregularity of GCN tasks and improve GCNear's performance. Comprehensive evaluations on twelve GCN training tasks demonstrate that GCNear achieves 24.8× / 2.2× geomean speedup and 61.9× / 6.4× (geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and NVIDIA V100 GPU. To deal with deep GCN models and the ever-increasing graph scale, we also propose a Multi-GCNear system. Compared to state-of-the-art Roc and DistGNN systems, Multi-GCNear achieves up to 2.1× and 3.1× higher training speed, respectively.

show abstract

Section: Evaluation a Methodologymentioning

confidence: 99%

GCNear: A Hybrid Architecture for Efficient GCN Training with Near-Memory Processing

Zhu¹,

Liu²,

Wei³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The reconfigurable core consists of three MAC modules and four Multiplexers. Each MAC contains a BFloat16 multiplier and an FP32 adder [19], [20] to accommodate both training and inference. If only inference is desired, the hardware can be 8-bit int8 type [2], [3].…”

Section: A Reconfigurable Corementioning

confidence: 99%

Designing Efficient and High-Performance AI Accelerators With Customized STT-MRAM

Mishty

Sadi

2021

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

In this paper, we demonstrate the design of efficient and high-performance AI/Deep Learning accelerators with customized STT-MRAM and a reconfigurable core. Based on modeldriven detailed design space exploration, we present the design methodology of an innovative scratchpad-assisted on-chip STT-MRAM based buffer system for high-performance accelerators. Using analytically derived expression of memory occupancy time of AI model weights and activation maps, the volatility of STT-MRAM is adjusted with process and temperature variation aware scaling of thermal stability factor to optimize the retention time, energy, read/write latency, and area of STT-MRAM. From the analysis of modern AI workloads and accelerator implementation in 14nm technology, we verify the efficacy of our designed AI accelerator with STT-MRAM (STT-AI). Compared to an SRAMbased implementation, the STT-AI accelerator achieves 75% area and 3% power savings at iso-accuracy. Furthermore, with a relaxed bit error rate and negligible AI accuracy trade-off, the designed STT-AI Ultra accelerator achieves 75.4%, and 3.5% savings in area and power, respectively over regular SRAM-based accelerators.

show abstract

“…Micikevicius et al [91] proposed a general-purpose mixed precision training framework for training large-scale DNNs efficiently, almost halving the GPU memory usage. A mixed precision training framework adopting BFLOAT16 format, able to represent the same range of values as FP32 is, was presented by Kalamkar et al [92] to avoid the entailment of loss scaling in [91]. Recently, Yang et al [93] proposed a low-precision stochastic gradient descent (SGD) approach by taking advantage of stochastic weight averaging and quantizing the gradient accumulator as well as the velocity vector.…”

Section: Reduced-precision Training For Neural Networkmentioning

confidence: 99%

Whole Brain Segmentation with Full Volume Neural Network

Li,

Cui,

Sheng

et al. 2021

Preprint

View full text Add to dashboard Cite

Whole brain segmentation is an important neuroimaging task that segments the whole brain volume into anatomically labeled regions-of-interest. Convolutional neural networks have demonstrated good performance in this task. Existing solutions, usually segment the brain image by classifying the voxels, or labeling the slices or the sub-volumes separately. Their representation learning is based on parts of the whole volume whereas their labeling result is produced by aggregation of partial segmentation. Learning and inference with incomplete information could lead to sub-optimal final segmentation result. To address these issues, we propose to adopt a full volume framework, which feeds the full volume brain image into the segmentation network and directly outputs the segmentation result for the whole brain volume. The framework makes use of complete information in each volume and can be implemented easily. An effective instance in this framework is given subsequently. We adopt the 3D high-resolution network (HRNet) for learning spatially fine-grained representations and the mixed precision training scheme for memory-efficient training. Extensive experiment results on a publicly available 3D MRI brain dataset show that our proposed model advances the state-of-the-art methods in terms of segmentation performance.

show abstract

A Study of BFLOAT16 for Deep Learning Training

Cited by 45 publications

References 20 publications

GCNear: A Hybrid Architecture for Efficient GCN Training with Near-Memory Processing

GCNear: A Hybrid Architecture for Efficient GCN Training with Near-Memory Processing

Designing Efficient and High-Performance AI Accelerators With Customized STT-MRAM

Whole Brain Segmentation with Full Volume Neural Network

Contact Info

Product

Resources

About