Dirk Meister scite author profile

Kaiser

et al. 2012

Deduplication is a storage saving technique that is highly successful in enterprise backup environments. On a ﬁle system, a single data block might be stored multiple times across different ﬁles, for example, multiple versions of a ﬁle might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed – by storing data just\ud once, all ﬁles that use identical regions refer to the same unique data. The most common approach splits ﬁle data into chunks\ud and calculates a cryptographic ﬁngerprint for each chunk. By checking if the ﬁngerprint has already been stored, a chunk is classiﬁed as redundant or unique. Only unique chunks are stored. This paper presents the ﬁrst study on the potential of data deduplication in HPC centers, which belong to the most demanding storage producers. We have quantitatively assessed this potential for capacity reduction for 4 data centers (BSC, DKRZ,\ud RENCI, RWTH). In contrast to previous deduplication studies focusing mostly on backup data, we have analyzed over one PB\ud (1212 TB) of online ﬁle system data. The evaluation shows that typically 20% to 30% of this online data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This reduction can only be achieved by a subﬁle deduplication approach, while approaches based on whole-ﬁle\ud comparisons only lead to small capacity savings.Peer ReviewedPostprint (published version

dedupv1: Improving deduplication throughput using solid state drives (SSD)

2010

Data deduplication systems discover and remove redundancies between data blocks. The search for redundant data blocks is often based on hashing the content of a block and comparing the resulting hash value with already stored entries inside an index. The limited random IO performance of hard disks limits the overall throughput of such systems, if the index does not fit into main memory.This paper presents the architecture of the dedupv1 deduplication system that uses solid-state drives (SSDs) to improve its throughput compared to disk-based systems. dedupv1 is designed to use the sweet spots of SSD technology (random reads and sequential operations), while avoiding random writes inside the data path. This is achieved by using a hybrid deduplication design. It is an inline deduplication system as it performs chunking and fingerprinting online and only stores new data, but it is able to delay much of the processing as well as IO operations.

Multi-level comparison of data deduplication in a backup scenario

2009

Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks.This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-life user data in a weekly full backup scenario, both at a single point in time as well as over several weeks.In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application-specific chunking can help to further decrease storage capacity demands.

Block locality caching for data deduplication

Meister¹,

Kaiser²,

Brinkmann³

2013

Data deduplication systems discover and remove redundancies between data blocks by splitting the data stream into chunks and comparing a hash of each chunk with all previously stored hashes. Storing the corresponding chunk index on hard disks immediately limits the achievable throughput, as these devices are unable to support the high number of random IOs induced by this index. Several approaches to overcome this chunk lookup disk bottleneck have been proposed. Often, the approaches try to capture the locality information of a backup run and use this in the next backup run to predict future chunk requests. However, often this locality is only captured by a surrogate, e.g., the order of the chunks in containers. [37]. Furthermore, some approaches degenerate slowly when the systems operate over months and years because the locality information becomes outdated.We propose a novel approach, called Block Locality Cache (BLC), that captures the previous backup run significantly better than existing approaches and also always uses up-todate locality information and which is, therefore, less prone to aging.We evaluate the approach using a trace-based simulation of multiple real-world backup datasets. The simulation compares the Block Locality Cache with the approach of Zhu et al. [37] and provides a detailed analysis of the behavior and IO pattern. Furthermore, a prototype implementation is used to validate the simulation.

Design of an exact data deduplication cluster

Kaiser

et al. 2012