Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference 2009
DOI: 10.1145/1534530.1534539
|View full text |Cite
|
Sign up to set email alerts
|

The design of a similarity based deduplication system

Abstract: We describe some of the design choices that were made during the development of a fast, scalable, inline, deduplication device. The system's design goals and how they were achieved are presented. This is the firs deduplication device that uses similarity matching. The paper provides the following original research contributions: we show how similarity signatures can serve in a deduplication scheme; a novel type of similarity signatures is presented and its advantages in the context of deduplication requirement… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
0

Year Published

2009
2009
2021
2021

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 82 publications
(45 citation statements)
references
References 22 publications
0
45
0
Order By: Relevance
“…Each host maintains logs which contain hash values of blocks and periodically generate queries to share block hashes and remove unnecessary hashes. Similarity based deduplication system [14] uses file similarity measure and uses delta encoding to reduce redundancy between segments with high similarity. Since the system only compares similarity, only 4 GB of memory is used when storing 1 PB of data.…”
Section: Related Workmentioning
confidence: 99%
“…Each host maintains logs which contain hash values of blocks and periodically generate queries to share block hashes and remove unnecessary hashes. Similarity based deduplication system [14] uses file similarity measure and uses delta encoding to reduce redundancy between segments with high similarity. Since the system only compares similarity, only 4 GB of memory is used when storing 1 PB of data.…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, for terabytes or petabytes of storage, the index is too large for memory and must be kept on disk, though several previous projects have used a full index for storing sketches [1,18,19,40]. As an example, for a production deduplicated storage system with 256 TB of capacity, 8 KB average chunk size, and 16 bytes per record, the sketch index would be a half-TB.…”
Section: Full Sketch Indexmentioning
confidence: 99%
“…Although n-level delta is possible for any value of n, decoding an n-level delta entails n reads of the appropriate base chunks, which can be inefficient in a storage system. For this reason, a delta storage system [1] may only support 1− or 2-level delta encodings to bound decode times. To compare the benefits of multi-and 1−level delta, we studied the compression differences.…”
Section: Multi-vs 1-level Deltamentioning
confidence: 99%
See 2 more Smart Citations