Clairvoyant prefetching for distributed machine learning I/O

Nicolae

Proceedings of the 12th Workshop on AI and Scientific Computing at Scale Using Flexible Computing Infrastructures

et al. 2022

The training of deep neural network models on large data remains a difficult problem, despite progress towards scalable techniques. In particular, there is a mismatch between the random but predetermined order in which AI flows select training samples and the streaming I/O patterns for which traditional HPC data storage (e.g., parallel file systems) are designed. In addition, as more data are obtained, it is feasible neither simply to train learning models incrementally, due to catastrophic forgetting (i.e., bias towards new samples), nor to train frequently from scratch, due to prohibitive time and/or resource constraints. In this paper, we study data management techniques that combine caching and streaming with rehearsal support in order to enable efficient access to training samples in both offline training and continual learning. We revisit state-of-art streaming approaches based on data pipelines that transparently handle prefetching, caching, shuffling, and data augmentation, and discuss the challenges and opportunities that arise when combining these methods with data-parallel training techniques. We also report on preliminary experiments that evaluate the I/O overheads involved in accessing the training samples from a parallel file system (PFS) under several concurrency scenarios, highlighting the impact of the PFS on the design of the data pipelines.

Section: Related Workmentioning

confidence: 99%

Large Scale Caching and Streaming of Training Data for Online Deep Learning

Nicolae

Proceedings of the 12th Workshop on AI and Scientific Computing at Scale Using Flexible Computing Infrastructures

et al. 2022

“…When applying data parallelism on HPC clusters, the training typically includes three stages: (1) I/O: loading the data from a remote parallel file system (i.e., GPFS, Lustre) to host memory; (2) Computation: performing forward and backward phases to calculate the local gradient on each device; (3) Communication: synchronizing averaged gradients across multiple devices to update model weights. Among these stages, I/O is a bottleneck for distributed training with large-scale datasets [12,35,51]. Thus, SOLAR is designed primarily to optimize the data loading stage in training DNN-based surrogates.…”

Section: Distributed Training With Data Parallelismmentioning

confidence: 99%

“…Existing Works There are a few state-of-the-art data loaders have been proposed to tackle some of the above issues caused by the buffering scheme. (1) NoPFS [12] utilizes a heuristic performance model to predict the data to be used by the next epoch, and accordingly determine the data eviction scheme to increase the buffer hit rate. (2) DeepIO [51] limits the data shuffling within the buffer of each compute node to eliminate the buffer misses and achieve maximum data reuse.…”

Section: Introductionmentioning

confidence: 99%

SOLAR: A Highly Optimized Data Loading Framework for Distributed Training of CNN-based Scientific Surrogates

Sun¹,

Yu²,

Zhang³

et al. 2022

Preprint

CNN-based surrogates have become prevalent in scientific applications to replace conventional time-consuming physical approaches. Although these surrogates can yield satisfactory results with significantly lower computation costs over small training datasets, our benchmarking results show that data-loading overhead becomes the major performance bottleneck when training surrogates with large datasets. In practice, surrogates are usually trained with highresolution scientific data, which can easily reach the terabyte scale. Several state-of-the-art data loaders are proposed to improve the loading throughput in general CNN training; however, they are sub-optimal when applied to the surrogate training. In this work, we propose SOLAR, a surrogate data loader, that can ultimately increase loading throughput during the training. It leverages our three key observations during the benchmarking and contains three novel designs. Specifically, SOLAR first generates a pre-determined shuffled index list and accordingly optimizes the global access order and the buffer eviction scheme to maximize the data reuse and the buffer hit rate. It then proposes a tradeoff between lightweight computational imbalance and heavyweight loading workload imbalance to speed up the overall training. It finally optimizes its data access pattern with HDF5 to achieve a better parallel I/O throughput. Our evaluation with three scientific surrogates and 32 GPUs illustrates that SOLAR can achieve up to 24.4× speedup over PyTorch Data Loader and 3.52× speedup over state-of-the-art data loaders.

“…With innovative progress in computing technology, GPU vendors are making individual GPUs bigger and faster -where an individual GPU can now deliver more than 300 TeraFLOPS of performance and is on the path to becoming a supercomputer of the past by itself [5,6]. This trend has served the AI/ML models well since the computing requirements of these models are increasing at a rapid pace [7][8][9]. Unfortunately, as our experimental characterization (Sec.…”

Section: Introductionmentioning

confidence: 99%

Using Multi-Instance GPU for Efficient Operation of Multi-Tenant GPU Clusters

Li¹,

Patel²,

Samsi³

et al. 2022

Preprint

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability of NVIDIA A100 GPUs to dynamically partition GPU resources among co-located jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49% and 16% lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively.