Design of an exact data deduplication cluster

Kaiser, Jürgen; Meister, Dirk; Brinkmann, André; Effert, Sascha

doi:10.1109/msst.2012.6232380

Cited by 26 publications

(15 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other distributed systems assume nodes with individual CPU and RAM that have access to a shared storage device abstraction where nodes perform deduplication in parallel. This allows the sharing of metadata information between nodes by keeping it on the shared storage device, which otherwise would have to be sent over the network [Clements et al 2009;Kaiser et al 2012]. Finally, distinct nodes may handle distinct tasks.…”

Section: Scopementioning

confidence: 99%

“…Although specific details are not presented, several gateways can be combined to perform deduplication over a common data repository, thus allowing global distributed deduplication. As a distinct approach, the dedupv1 centralized design can be extended over a shared storage device (SAN) where several nodes have exclusive access to their own data partitions [Kaiser et al 2012]. Nodes are seen as independent dedupv1 nodes that export their own iSCSI interface, partition data, compute hashes, and map chunk requests to the correct nodes.…”

Section: Backup and Archival Storagementioning

confidence: 99%

See 1 more Smart Citation

A Survey and Classification of Storage Deduplication Systems

2014

View full text Add to dashboard Cite

The automatic elimination of duplicate data in a storage system, commonly known as deduplication, is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid-state drives, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development.The first contribution of this article is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.

show abstract

Section: Scopementioning

confidence: 99%

Section: Backup and Archival Storagementioning

confidence: 99%

A Survey and Classification of Storage Deduplication Systems

2014

View full text Add to dashboard Cite

show abstract

“…On the other hand, a decentralized approach to distributing deduplication metadata management across multiple servers [10,7,8,5,21,9,15,12] require additional hardware and software resource cost for multiple deduplication servers. In order to reduce such additional cost, simple DB-sharding approach that embeds the DB-shard of the whole dedup metadata database on each storage server has been proposed [13]. However, this DB-sharding approach to SN-SS suffers from inherited problems, i.e., to identify a duplicate chunk, the fingerprint lookup must be broadcasted to all DB-shards in the cluster.…”

Section: Introductionmentioning

confidence: 99%

A Robust Fault-Tolerant and Scalable Cluster-Wide Deduplication for Shared-Nothing Storage Systems

Khan

Lee

Hamandawana

et al. 2018

2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS

View full text Add to dashboard Cite

Deduplication has been largely employed in distributed storage systems to improve space efficiency. Traditional deduplication research ignores the design specifications of shared-nothing distributed storage systems such as no central metadata bottleneck, scalability, and storage rebalancing. Further, deduplication introduces transactional changes, which are prone to errors in the event of a system failure, resulting in inconsistencies in data and deduplication metadata. In this paper, we propose a robust, fault-tolerant and scalable clusterwide deduplication that can eliminate duplicate copies across the cluster. We design a distributed deduplication metadata shard which guarantees performance scalability while preserving the design constraints of sharednothing storage systems. The placement of chunks and deduplication metadata is made cluster-wide based on the content fingerprint of chunks. To ensure transactional consistency and garbage identification, we employ a flagbased asynchronous consistency mechanism. We implement the proposed deduplication on Ceph. The evaluation shows high disk-space savings with minimal performance degradation as well as high robustness in the event of sudden server failure. * Mr. Prince is currently affiliated with Ajou University, Suwon, Republic of Korea.

show abstract

“…A major class of data deduplication systems is called fingerprinting-based data deduplication [25,37,17,3,21,5,6,12,34]. The generic design for backup-oriented deduplication systems splits the data stream into chunks.…”

Section: Introductionmentioning

confidence: 99%

Block locality caching for data deduplication

Meister¹,

Kaiser²,

Brinkmann³

2013

Proceedings of the 6th International Systems and Storage Conference on - SYSTOR '13

Self Cite

View full text Add to dashboard Cite

Data deduplication systems discover and remove redundancies between data blocks by splitting the data stream into chunks and comparing a hash of each chunk with all previously stored hashes. Storing the corresponding chunk index on hard disks immediately limits the achievable throughput, as these devices are unable to support the high number of random IOs induced by this index. Several approaches to overcome this chunk lookup disk bottleneck have been proposed. Often, the approaches try to capture the locality information of a backup run and use this in the next backup run to predict future chunk requests. However, often this locality is only captured by a surrogate, e.g., the order of the chunks in containers. [37]. Furthermore, some approaches degenerate slowly when the systems operate over months and years because the locality information becomes outdated.We propose a novel approach, called Block Locality Cache (BLC), that captures the previous backup run significantly better than existing approaches and also always uses up-todate locality information and which is, therefore, less prone to aging.We evaluate the approach using a trace-based simulation of multiple real-world backup datasets. The simulation compares the Block Locality Cache with the approach of Zhu et al. [37] and provides a detailed analysis of the behavior and IO pattern. Furthermore, a prototype implementation is used to validate the simulation.

show abstract

Design of an exact data deduplication cluster

Cited by 26 publications

References 18 publications

A Survey and Classification of Storage Deduplication Systems

A Survey and Classification of Storage Deduplication Systems

A Robust Fault-Tolerant and Scalable Cluster-Wide Deduplication for Shared-Nothing Storage Systems

Block locality caching for data deduplication

Contact Info

Product

Resources

About