MLBS: Transparent Data Caching in Hierarchical Storage for Out-of-Core HPC Applications

Alturkestani, Tariq; Tonellot, Thierry; Ltaief, Hatem; Abdelkhalak, Rached; Etienne, V.; Keyes, David E.

doi:10.1109/hipc.2019.00046

Cited by 8 publications

(6 citation statements)

References 25 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) On-the-fly Individual Checkpoint Compression: In this approach, the checkpoints are compressed one at a time at their source, i.e., the GPU HBM in our case, by the checkpointing runtime. This approach is widely used for accelerating data transfer for out-of-core stencil computations [22], reverse-mode adjoint computations [23], and reducing datastream intensity from scientific equipment, e.g., Advanced Photon source [24]. Therefore, we consider this approach as representative of state-of-the-art GPU-compression-enabled data movement techniques.…”

Section: B Compared Approachesmentioning

confidence: 99%

Towards Efficient I/O Pipelines Using Accumulated Compression

Maurya,

Nicolae,

Rafique

et al. 2023

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

High-Performance Computing (HPC) workloads generate large volumes of data at high-frequency during their execution, which needs to be captured concurrently at scale. These workloads exploit accelerators such as GPU for faster performance. However, the limited onboard high-bandwidth memory (HBM) on the GPU, and slow device-to-host memory PCIe interconnects lead to I/O overheads during application execution, thereby exacerbating their overall runtime. To overcome the aforementioned limitations, techniques such as compression and asynchronous transfers have been used by data management runtimes. However, compressing small blocks of data leads to a significant runtime penalty on the application. In this paper, we design and develop strategies to optimize the tradeoff between compressing checkpoints instantly and enqueuing transfers immediately versus accumulating snapshots and delaying compression to achieve faster compression throughput. Our evaluations on synthetic and real-life workloads for different systems and workload configurations demonstrate 1.3× to 8.3× speedup compared to the existing checkpoint approaches.

show abstract

Section: B Compared Approachesmentioning

confidence: 99%

Towards Efficient I/O Pipelines Using Accumulated Compression

Maurya,

Nicolae,

Rafique

et al. 2023

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

show abstract

“…System-level checkpointing libraries such as NVCR [24] and CheCUDA [23], transparently record and replay the memory based CUDA APIs for checkpointing and restoring. Approaches such as CheckFreq [30], GPUsnapshot [31] and Multi-layered Buffered System [32] also exploit heterogeneous storage tiers. However, none of these approaches consider short-running jobs, for which the impact of initialization overheads is non-negligible.…”

Section: Related Workmentioning

confidence: 99%

“…One such popular scenario is the use of checkpointing for the purpose of revisiting previous states in order to advance a computation. For example, the adjoint state method is an efficient numerical method to compute the gradient that is widely employed by automatic differentiation (AD) tools [2], [3] and used in a variety of scientific applications: climate and ocean modeling [4], multi-physics [5], seismic imaging in the oil industry [6], etc. Deep learning (DL) techniques are also based on AD and often paired with stochastic gradient descent [7].…”

Section: Introductionmentioning

confidence: 99%

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Maurya

Nicolae

Rafique

et al. 2022

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Self Cite

View full text Add to dashboard Cite

“…Systemlevel checkpoint-restart libraries, e.g., CheCUDA [16] and NVCR [15] transparently record and replay all the memorybased API calls. Efforts such as MLBS [21] and Check-Freq [22] leverage multi-level memory subsystem starting from GPU memory to minimize checkpoint time. However, none of these approaches analyze the imbalance of checkpoint sizes across multiple devices, nor do they leverage peer-to-peer transfers to reduce the checkpointing overheads.…”

Section: Related Workmentioning

confidence: 99%

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Maurya

Nicolae

Rafique

et al. 2021

2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Self Cite

View full text Add to dashboard Cite

Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.

show abstract

MLBS: Transparent Data Caching in Hierarchical Storage for Out-of-Core HPC Applications

Cited by 8 publications

References 25 publications

Towards Efficient I/O Pipelines Using Accumulated Compression

Towards Efficient I/O Pipelines Using Accumulated Compression

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Contact Info

Product

Resources

About