A study on data deduplication in HPC storage systems

Meister, Dirk; Kaiser, Jürgen; Brinkmann, André; Cortes, Toni; Kuhn, Michael; Kunkel, Julian

doi:10.1109/sc.2012.14

Cited by 85 publications

(43 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The zero chunk contributes significantly to the deduplication potential in enterprise backups and virtual machine images [55], [56]. In their HPC study, Meister et al found that between 3.1% and 24.3% of their HPC data consist of zero chunks [12]. In our case, the zero chunk is the most used chunk and is the main source of redundant data for every application and chunk size, except CDC with an average chunk size of 32 KB.…”

Section: A General Deduplicationmentioning

confidence: 57%

“…However, we vary the number of used processes in Section V-C. Table I shows the different sizes of the checkpoints. c) Deduplication: We analyzed each checkpoint with the FS-C deduplication tool suite [49], which has already been applied in several deduplication studies [50], [51]. We chose fixed-sized chunking and content-defined chunking (CDC) as chunking methods.…”

Section: Deduplication Of Checkpointsmentioning

confidence: 99%

“…Besides backup, primary storage also provides deduplication potential. Meister et al showed a huge potential for data deduplication in HPC storage systems that is not facilitated by today's HPC file systems [12]. Nicolae and Kulkarni et al applied the deduplication approach to the checkpointing use case and discussed systems to reduce the I/O load during checkpointing [13], [14].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Section: A General Deduplicationmentioning

confidence: 57%

Section: Deduplication Of Checkpointsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

“…Instead of storing duplicate data, a reference to the original block is created for each repeated occurrence. Our previously conducted study for HPC data already showed great potential for data savings, allowing 20-30 % of redundant data to be eliminated on average [21]. To determine the potential savings, we independently scanned 12 sets of directories with a total amount of data of more than 1 PB.…”

Section: Deduplicationmentioning

confidence: 99%

Exascale Storage Systems - An Analytical Study of Expenses

Kunkel

Kuhn

Ludwig

2014

JSFI

View full text Add to dashboard Cite

1The computational power and storage capability of supercomputers are growing at a different pace, with storage lagging behind; the widening gap necessitates new approaches to keep the investment and running costs for storage systems at bay. In this paper, we aim to unify previous models and compare different approaches for solving these problems. By extrapolating the characteristics of the German Climate Computing Center's previous supercomputers to the future, cost factors are identified and quantified in order to foster adequate research and development. Using models to estimate the execution costs of two prototypical use cases, we are discussing the potential of three concepts: re-computation, data deduplication and data compression.

show abstract

“…attempt reduce the checkpoint sizes. While there are several techniques proposed in this direction, recent studies [6] point out that deduplication (i.e. identifying identical copies of data and storing only one copy) shows promising potential, with reported reductions of up to 70%.…”

Section: Introductionmentioning

confidence: 99%

Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

Nicolae

2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Abstract-With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.

show abstract

A study on data deduplication in HPC storage systems

Cited by 85 publications

References 26 publications

Deduplication Potential of HPC Applications’ Checkpoints

Deduplication Potential of HPC Applications’ Checkpoints

Exascale Storage Systems - An Analytical Study of Expenses

Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

Contact Info

Product

Resources

About