2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) 2019
DOI: 10.1109/hipc.2019.00046
|View full text |Cite
|
Sign up to set email alerts
|

MLBS: Transparent Data Caching in Hierarchical Storage for Out-of-Core HPC Applications

Abstract: Out-of-core simulation systems produce and/or consume a massive amount of data that cannot fit on a single compute node memory and that usually needs to be read and/or written back and forth during computation. I/O data movement may thus represent a bottleneck in large-scale simulations. To increase I/O bandwidth, high-end supercomputers are equipped with hierarchical storage subsystems such as node-local and remote-shared NVMe and SSD-based Burst Buffers. Advanced caching systems have recently been developed … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 25 publications
(27 reference statements)
0
4
0
Order By: Relevance
“…2) On-the-fly Individual Checkpoint Compression: In this approach, the checkpoints are compressed one at a time at their source, i.e., the GPU HBM in our case, by the checkpointing runtime. This approach is widely used for accelerating data transfer for out-of-core stencil computations [22], reverse-mode adjoint computations [23], and reducing datastream intensity from scientific equipment, e.g., Advanced Photon source [24]. Therefore, we consider this approach as representative of state-of-the-art GPU-compression-enabled data movement techniques.…”
Section: B Compared Approachesmentioning
confidence: 99%
“…2) On-the-fly Individual Checkpoint Compression: In this approach, the checkpoints are compressed one at a time at their source, i.e., the GPU HBM in our case, by the checkpointing runtime. This approach is widely used for accelerating data transfer for out-of-core stencil computations [22], reverse-mode adjoint computations [23], and reducing datastream intensity from scientific equipment, e.g., Advanced Photon source [24]. Therefore, we consider this approach as representative of state-of-the-art GPU-compression-enabled data movement techniques.…”
Section: B Compared Approachesmentioning
confidence: 99%
“…System-level checkpointing libraries such as NVCR [24] and CheCUDA [23], transparently record and replay the memory based CUDA APIs for checkpointing and restoring. Approaches such as CheckFreq [30], GPUsnapshot [31] and Multi-layered Buffered System [32] also exploit heterogeneous storage tiers. However, none of these approaches consider short-running jobs, for which the impact of initialization overheads is non-negligible.…”
Section: Related Workmentioning
confidence: 99%
“…One such popular scenario is the use of checkpointing for the purpose of revisiting previous states in order to advance a computation. For example, the adjoint state method is an efficient numerical method to compute the gradient that is widely employed by automatic differentiation (AD) tools [2], [3] and used in a variety of scientific applications: climate and ocean modeling [4], multi-physics [5], seismic imaging in the oil industry [6], etc. Deep learning (DL) techniques are also based on AD and often paired with stochastic gradient descent [7].…”
Section: Introductionmentioning
confidence: 99%
“…Systemlevel checkpoint-restart libraries, e.g., CheCUDA [16] and NVCR [15] transparently record and replay all the memorybased API calls. Efforts such as MLBS [21] and Check-Freq [22] leverage multi-level memory subsystem starting from GPU memory to minimize checkpoint time. However, none of these approaches analyze the imbalance of checkpoint sizes across multiple devices, nor do they leverage peer-to-peer transfers to reduce the checkpointing overheads.…”
Section: Related Workmentioning
confidence: 99%