The design of a similarity based deduplication system

Aronovich, Lior; Asher, Ron; Bachmat, Eitan; Bitner, Haim; Hirsch, Michael; Klein, Shmuel T.

doi:10.1145/1534530.1534539

Cited by 82 publications

(45 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each host maintains logs which contain hash values of blocks and periodically generate queries to share block hashes and remove unnecessary hashes. Similarity based deduplication system [14] uses file similarity measure and uses delta encoding to reduce redundancy between segments with high similarity. Since the system only compares similarity, only 4 GB of memory is used when storing 1 PB of data.…”

Section: Related Workmentioning

confidence: 99%

Energy Efficient Metadata Management for Cloud Storage System

Kim

et al. 2015

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

To effectively handle duplicate files, data deduplication schemes are widely used in many storage systems. Data deduplication algorithms reduce storage space by eliminating data to ensure that only single instance of data is stored in storage device. In this paper, we propose an energy efficient file synchronization scheme that provides hybrid data chunking using variable-length chunking (VLC) and fixed-length chunking (FLC). The main idea is to analyze similarities between old and new versions of data and decide which chunking method to apply in synchronizing the files. In particular, the proposed algorithm exploits the file similarity pattern for calculating the energy efficiency of chunking algorithms. We have developed an Android mobile application for file synchronization and measured energy consumption. The experiment results show that the proposed scheme helps save energy in synchronizing files, regardless of file types or amount of redundancies the files have.

show abstract

Section: Related Workmentioning

confidence: 99%

Energy Efficient Metadata Management for Cloud Storage System

Kim

et al. 2015

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

show abstract

“…Unfortunately, for terabytes or petabytes of storage, the index is too large for memory and must be kept on disk, though several previous projects have used a full index for storing sketches [1,18,19,40]. As an example, for a production deduplicated storage system with 256 TB of capacity, 8 KB average chunk size, and 16 bytes per record, the sketch index would be a half-TB.…”

Section: Full Sketch Indexmentioning

confidence: 99%

“…Although n-level delta is possible for any value of n, decoding an n-level delta entails n reads of the appropriate base chunks, which can be inefficient in a storage system. For this reason, a delta storage system [1] may only support 1− or 2-level delta encodings to bound decode times. To compare the benefits of multi-and 1−level delta, we studied the compression differences.…”

Section: Multi-vs 1-level Deltamentioning

confidence: 99%

“…We would like the pool of eligible data to include previous versions, maximizing our potential compression gains. A standard approach is to use a full index across the entire dataset, which requires space on disk, disk I/O, and ongoing updates [1,19]. An alternative is to use a partial index holding data that has recently been transferred, which removes the persistent structures but shrinks the pool of eligible data [35].…”

Section: Introductionmentioning

confidence: 99%

“…Second, our architecture only requires one index of fingerprints, while traditional similarity detection required one or more on-disk indexes for sketches [1,19] or used a partial index with a decrease in compression. Another important consideration in minimizing the number of indexes is that updating the index during file deletion is a complicated step, and reducing complexity/error cases is important for production systems.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

WAN-optimized replication of backup datasets using stream-informed delta compression

Shilane¹,

Huang²,

Wallace³

et al. 2012

ACM Trans. Storage

View full text Add to dashboard Cite

Replicating data off-site is critical for disaster recovery reasons, but the current approach of transferring tapes is cumbersome and error-prone. Replicating across a wide area network (WAN) is a promising alternative, but fast network connections are expensive or impractical in many remote locations, so improved compression is needed to make WAN replication truly practical. We present a new technique for replicating backup datasets across a WAN that not only eliminates duplicate regions of files (deduplication) but also compresses similar regions of files with delta compression, which is available as a feature of EMC Data Domain systems.Our main contribution is an architecture that adds stream-informed delta compression to already existing deduplication systems and eliminates the need for new, persistent indexes. Unlike techniques based on knowing a file's version or that use a memory cache, our approach achieves delta compression across all data replicated to a server at any time in the past. From a detailed analysis of datasets and hundreds of customers using our product, we achieve an additional 2X compression from delta compression beyond deduplication and local compression, which enables customers to replicate data that would otherwise fail to complete within their backup window.

show abstract