Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2021
DOI: 10.1145/3458817.3476181
|View full text |Cite
|
Sign up to set email alerts
|

Clairvoyant prefetching for distributed machine learning I/O

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
22
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(22 citation statements)
references
References 50 publications
0
22
0
Order By: Relevance
“…Other approaches such as FanStore [26] provide a global cache layer on node-local burst buffers in a compressed format, allowing POSIX-compliant file access to the compressed data in user space. Further optimizations explore prefetching with perfect knowledge of future I/O based on fixing the seeds of pseudo-random number generators [5]. Such approaches are limited in their applicability to accelerating low-level I/O operations only.…”
Section: Related Workmentioning
confidence: 99%
“…Other approaches such as FanStore [26] provide a global cache layer on node-local burst buffers in a compressed format, allowing POSIX-compliant file access to the compressed data in user space. Further optimizations explore prefetching with perfect knowledge of future I/O based on fixing the seeds of pseudo-random number generators [5]. Such approaches are limited in their applicability to accelerating low-level I/O operations only.…”
Section: Related Workmentioning
confidence: 99%
“…When applying data parallelism on HPC clusters, the training typically includes three stages: (1) I/O: loading the data from a remote parallel file system (i.e., GPFS, Lustre) to host memory; (2) Computation: performing forward and backward phases to calculate the local gradient on each device; (3) Communication: synchronizing averaged gradients across multiple devices to update model weights. Among these stages, I/O is a bottleneck for distributed training with large-scale datasets [12,35,51]. Thus, SOLAR is designed primarily to optimize the data loading stage in training DNN-based surrogates.…”
Section: Distributed Training With Data Parallelismmentioning
confidence: 99%
“…Existing Works There are a few state-of-the-art data loaders have been proposed to tackle some of the above issues caused by the buffering scheme. (1) NoPFS [12] utilizes a heuristic performance model to predict the data to be used by the next epoch, and accordingly determine the data eviction scheme to increase the buffer hit rate. (2) DeepIO [51] limits the data shuffling within the buffer of each compute node to eliminate the buffer misses and achieve maximum data reuse.…”
Section: Introductionmentioning
confidence: 99%
“…With innovative progress in computing technology, GPU vendors are making individual GPUs bigger and faster -where an individual GPU can now deliver more than 300 TeraFLOPS of performance and is on the path to becoming a supercomputer of the past by itself [5,6]. This trend has served the AI/ML models well since the computing requirements of these models are increasing at a rapid pace [7][8][9]. Unfortunately, as our experimental characterization (Sec.…”
Section: Introductionmentioning
confidence: 99%