Analyzing and Mitigating Data Stalls in DNN Training

Mohan, Jayashree; Phanishayee, Amar; Raniwala, Ashish; Chidambaram, Vijay

doi:10.48550/arxiv.2007.06775

Cited by 6 publications

(11 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An input bottleneck occurs when the input pipeline is not able to generate batches of training examples as fast as the training computation can consume them. If the time spent waiting for the input pipeline exceeds tens of microseconds on average, the input pipeline is not keeping up with model training, causing a data stall (Mohan et al, 2020) The current practice of pipeline tuning, which optimizes the throughput (rate) of the pipeline, is explained below.…”

Section: Understanding Input Bottlenecksmentioning

confidence: 99%

“…Dataset Echoing (Choi et al, 2019) repeats input pipeline operations to match the rate of input pipeline with compute steps. DS-Analyzer predicts how much file cache memory is necessary to match the compute steps (Mohan et al, 2020). Progressive Compressed Records (Kuchnik et al, 2019) match compression levels to likewise minimize I/O.…”

Section: Related Workmentioning

confidence: 99%

“…Plumber generates similar plots using Dataset and resource limits. Recent studies have analyzed ML workloads and found that data pipelines can become bottlenecks (Mohan et al, 2020;Murray et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Kuchnik¹,

Klimovic²,

Šimša³

et al. 2021

Preprint

View full text Add to dashboard Cite

Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over 2 million ML jobs in Google datacenters reveals that a significant fraction of model training jobs could benefit from faster input data pipelines. At the same time, our analysis reveals that most jobs do not saturate host hardware, pointing in the direction of software-based bottlenecks. Motivated by these findings, we propose Plumber, a tool for finding bottlenecks in ML input pipelines. Plumber uses an extensible and interprettable operational analysis analytical model to automatically tune parallelism, prefetching, and caching under host resource constraints. Across five representative ML pipelines, Plumber obtains speedups of up to 46× for misconfigured pipelines. By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.

show abstract

Section: Understanding Input Bottlenecksmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Kuchnik¹,

Klimovic²,

Šimša³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This pipeline overlaps with the training itself, aiming to improve resource utilization and hide the overhead of assembling the minibatches. However, despite this overlap, data loading and preprocessing often become a bottleneck [8,19,26,30,39], with reports of overheads of up to 72% of end-to-end training time [26,30]. With ever-increasing accumulation of training data, data loading is likely to become yet more costly, prompting the need for scalable solutions to mitigate these overheads.…”

Section: Introductionmentioning

confidence: 99%

Lobster: Load Balance-Aware I/O for Distributed DNN Training

Liu

Nicolae

Liu

2022

Proceedings of the 51st International Conference on Parallel Processing

View full text Add to dashboard Cite

The resource-hungry and time-consuming process of training Deep Neural Networks (DNNs) can be accelerated by optimizing and/or scaling computations on accelerators such as GPUs. However, the loading and pre-processing of training samples then often emerges as a new bottleneck. This data loading process engages a complex pipeline that extends from the sampling of training data on external storage to delivery of those data to GPUs, and that comprises not only expensive I/O operations but also decoding, shuffling, batching, augmentation, and other operations. We propose in this paper a new holistic approach to data loading that addresses three challenges not sufficiently addressed by other methods: I/O load imbalances among the GPUs on a node; rigid resource allocations to data loading and data preprocessing steps, which lead to idle resources and bottlenecks; and limited efficiency of caching strategies based on pre-fetching due to eviction of training samples needed soon at the expense of those needed later. We first present a study of key bottlenecks observed as training samples flow through the data loading and preprocessing pipeline. Then, we describe Lobster, a data loading runtime that uses performance modeling and advanced heuristics to combine flexible thread management with optimized eviction for distributed caching in order to mitigate I/O overheads and load imbalances. Experiments with a range of models and datasets show that the Lobster approach reduces both I/O overheads and end-to-end training times by up to 1.5× compared with stateof-the-art approaches.

show abstract

“…Storage capacity and bandwidth requirements (bytes moved from storage to compute device per image) also scale quadratically with image resolution, affecting the monetary cost (both storage and network usage are billed) of inference in real-world datacenter or cloud deployments where a separate storage cluster is usually used to store and forward input data through the network [21]. As a result, DNN training is frequently dominated by data stall time, which happens both remotely and locally, and can be due to CPU decoding overhead [20].…”

Section: Introductionmentioning

confidence: 99%

Characterizing and Taming Resolution in Convolutional Neural Networks

Yan¹,

Luo²,

Ceze³

2021

Preprint

View full text Add to dashboard Cite

Image resolution has a significant effect on the accuracy and computational, storage, and bandwidth costs of computer vision model inference. These costs are exacerbated when scaling out models to large inference serving systems and make image resolution an attractive target for optimization. However, the choice of resolution inherently introduces additional tightly coupled choices, such as image crop size, image detail, and compute kernel implementation that impact computational, storage, and bandwidth costs. Further complicating this setting, the optimal choices from the perspective of these metrics are highly dependent on the dataset and problem scenario. We characterize this tradeoff space, quantitatively studying the accuracy and efficiency tradeoff via systematic and automated tuning of image resolution, image quality and convolutional neural network operators. With the insights from this study, we propose a dynamic resolution mechanism that removes the need to statically choose a resolution ahead of time. Our evaluation shows that our dynamic resolution approach improves inference latency by 1.2×−1.7×, reduces data access volume by up to 20-30%, without affecting accuracy. We establish the dynamic resolution approach as a viable alternative to fine-tuning for a specific object scale to compensate for unknown crop sizes, which is the current state of the art.

show abstract

Analyzing and Mitigating Data Stalls in DNN Training

Cited by 6 publications

References 39 publications

Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Lobster: Load Balance-Aware I/O for Distributed DNN Training

Characterizing and Taming Resolution in Convolutional Neural Networks

Contact Info

Product

Resources

About